In best linear prediction (BLP), a true test score is predicted by observed item scores and by ancillary test data. If the use of BLP rather than a more direct estimate of a true score has disparate impact for different demographic groups, then a fairness issue arises. To improve population invariance but to preserve much of the efficiency of BLP, a modified approach, penalized best linear prediction, is proposed that weights both mean square error of prediction and a quadratic measure of subgroup biases.
View Article and Find Full Text PDFBr J Math Stat Psychol
May 2015
In many educational tests which involve constructed responses, a traditional test score is obtained by adding together item scores obtained through holistic scoring by trained human raters. For example, this practice was used until 2008 in the case of GRE(®) General Analytical Writing and until 2009 in the case of TOEFL(®) iBT Writing. With use of natural language processing, it is possible to obtain additional information concerning item responses from computer programs such as e-rater(®).
View Article and Find Full Text PDFResidual analysis (e.g. Hambleton & Swaminathan, Item response theory: principles and applications, Kluwer Academic, Boston, 1985; Hambleton, Swaminathan, & Rogers, Fundamentals of item response theory, Sage, Newbury Park, 1991) is a popular method to assess fit of item response theory (IRT) models.
View Article and Find Full Text PDFThis commentary addresses the modeling and final analytical path taken, as well as the terminology used, in the paper "Hierarchical diagnostic classification models: a family of models for estimating and testing attribute hierarchies" by Templin and Bradshaw (Psychometrika, doi: 10.1007/s11336-013-9362-0, 2013). It raises several issues concerning use of cognitive diagnostic models that either assume attribute hierarchies or assume a certain form of attribute interactions.
View Article and Find Full Text PDFMonitoring a very frequently administered educational test with a relatively short history of stable operation imposes a number of challenges. Test scores usually vary by season, and the frequency of administration of such educational tests is also seasonal. Although it is important to react to unreasonable changes in the distributions of test scores in a timely fashion, it is not a simple matter to ascertain what sort of distribution is really unusual.
View Article and Find Full Text PDFBr J Math Stat Psychol
November 2013
Haberman (2008) suggested a method to determine if subtest scores have added value over the total score. The method is based on classical test theory and considers the estimation of the true subscores. Performance of subgroups, for example, those based on gender or ethnicity, on subtests is often of interest.
View Article and Find Full Text PDFDiagnostic scores are of increasing interest in educational testing due to their potential remedial and instructional benefit. Naturally, the number of educational tests that report diagnostic scores is on the rise, as are the number of research publications on such scores. This article provides a critical evaluation of diagnostic score reporting in educational testing.
View Article and Find Full Text PDFWhen a simple random sample of size n is employed to establish a classification rule for prediction of a polytomous variable by an independent variable, the best achievable rate of misclassification is higher than the corresponding best achievable rate if the conditional probability distribution is known for the predicted variable given the independent variable. In typical cases, this increased misclassification rate due to sampling is remarkably small relative to other increases in expected measures of prediction accuracy due to samplings that are typically encountered in statistical analysis.This issue is particularly striking if a polytomous variable predicts a polytomous variable, for the excess misclassification rate due to estimation approaches 0 at an exponential rate as n increases.
View Article and Find Full Text PDF