Evaluating large language models for criterion-based grading from agreement to consistency.

Da-Wei Zhang Melissa Boey Yan Yu Tan Alexis Hoh Sheng Jia

NPJ Sci Learn

Department of Psychology, Jeffrey Cheah School of Medicine and Health Sciences, Monash University Malaysia, Bandar Sunway, 475000, Malaysia.

Published: December 2024

This study evaluates the ability of large language models (LLMs) to deliver criterion-based grading and examines the impact of prompt engineering with detailed criteria on grading. Using well-established human benchmarks and quantitative analyses, we found that even free LLMs achieve criterion-based grading with a detailed understanding of the criteria, underscoring the importance of domain-specific understanding over model complexity. These findings highlight the potential of LLMs to deliver scalable educational feedback.

Download full-text PDF	Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC11683144	PMC
http://dx.doi.org/10.1038/s41539-024-00291-1	DOI Listing

Publication Analysis

Top Keywords

criterion-based grading

large language

language models

llms deliver

evaluating large

models criterion-based

grading

grading agreement

agreement consistency

consistency study

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!