Document Type
Thesis
Degree
Master of Science
Major
Computer Science
Date of Defense
7-17-2025
Graduate Advisor
Badri Adhikari, Ph.D.
Committee
Azim Ahmadzadeh, Ph.D.
Sharlee Climer, Ph.D.
Mark Hauschild, Ph.D.
Abstract
Background: Automated essay scoring (AES) is a challenging deep learning problem. The two most widely used methods for predicting essay quality scores, supervised learning-based and LLM-based, have their own limitations. Although supervised learning-based methods are more accurate, they only predict a score and do not offer descriptive feedback to students. On the other hand, LLM-based methods can offer rubric-guided feedback but are known to be less accurate.
Methods: This work focuses on improving the accuracy of state-of-the-art LLM-based AES methods. We began by thoroughly investigating why these methods were performing poorly for certain datasets and certain examples. This led us to identify several limitations. After this, we tested several new prompting strategies to improve the accuracy on these hard cases, while maintaining the accuracy on the others.
Results: Our improved prompting strategies improved the state-of-the-art essay scoring accuracy by 11%, with an increase in average QWK agreement scores from 0.53 to 0.60. This significant improvement comes from enriching the context for the LLM, reiterating critical instructions, and walking the LLM through the grading process. While this improvement in accuracy suggests a promising direction for AES, the evaluation also revealed several tail risks that raise serious concerns about its real-world implementation.
Recommended Citation
Fink, Thomas A., "Limitations of Using Large Language Models for Automated Essay Scoring" (2025). Theses. 495.
https://irl.umsl.edu/thesis/495
Included in
Artificial Intelligence and Robotics Commons, Educational Assessment, Evaluation, and Research Commons, Educational Technology Commons