Evaluating large language models for criterion-based grading from agreement to consistency.

NPJ Sci Learn

Department of Psychology, Jeffrey Cheah School of Medicine and Health Sciences, Monash University Malaysia, Bandar Sunway, 475000, Malaysia.

Published: December 2024

This study evaluates the ability of large language models (LLMs) to deliver criterion-based grading and examines the impact of prompt engineering with detailed criteria on grading. Using well-established human benchmarks and quantitative analyses, we found that even free LLMs achieve criterion-based grading with a detailed understanding of the criteria, underscoring the importance of domain-specific understanding over model complexity. These findings highlight the potential of LLMs to deliver scalable educational feedback.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC11683144PMC
http://dx.doi.org/10.1038/s41539-024-00291-1DOI Listing

Publication Analysis

Top Keywords

criterion-based grading
12
large language
8
language models
8
llms deliver
8
evaluating large
4
models criterion-based
4
grading
4
grading agreement
4
agreement consistency
4
consistency study
4

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!