AI Article Synopsis

  • - The text discusses the growing interest in large language models (LLMs) in medical contexts, emphasizing the need to evaluate their accuracy on healthcare examinations relative to established human performance standards.
  • - A systematic review of literature up to September 2023 was conducted, analyzing the accuracy of LLMs in responding to healthcare examination questions, with a strict inclusion criterion to ensure relevant and original research.
  • - The findings indicated that LLMs had an overall medical exam accuracy of 0.61 and a USMLE accuracy of 0.51, highlighting a significant gap compared to human performance metrics in medical assessments.

Article Abstract

Background: Large language models (LLMs) have dominated public interest due to their apparent capability to accurately replicate learned knowledge in narrative text. However, there is a lack of clarity about the accuracy and capability standards of LLMs in health care examinations.

Objective: We conducted a systematic review of LLM accuracy, as tested under health care examination conditions, as compared to known human performance standards.

Methods: We quantified the accuracy of LLMs in responding to health care examination questions and evaluated the consistency and quality of study reporting. The search included all papers up until September 10, 2023, with all LLMs published in English journals that report clear LLM accuracy standards. The exclusion criteria were as follows: the assessment was not a health care exam, there was no LLM, there was no evaluation of comparable success accuracy, and the literature was not original research.The literature search included the following Medical Subject Headings (MeSH) terms used in all possible combinations: "artificial intelligence," "ChatGPT," "GPT," "LLM," "large language model," "machine learning," "neural network," "Generative Pre-trained Transformer," "Generative Transformer," "Generative Language Model," "Generative Model," "medical exam," "healthcare exam," and "clinical exam." Sensitivity, accuracy, and precision data were extracted, including relevant CIs.

Results: The search identified 1673 relevant citations. After removing duplicate results, 1268 (75.8%) papers were screened for titles and abstracts, and 32 (2.5%) studies were included for full-text review. Our meta-analysis suggested that LLMs are able to perform with an overall medical examination accuracy of 0.61 (CI 0.58-0.64) and a United States Medical Licensing Examination (USMLE) accuracy of 0.51 (CI 0.46-0.56), while Chat Generative Pretrained Transformer (ChatGPT) can perform with an overall medical examination accuracy of 0.64 (CI 0.6-0.67).

Conclusions: LLMs offer promise to remediate health care demand and staffing challenges by providing accurate and efficient context-specific information to critical decision makers. For policy and deployment decisions about LLMs to advance health care, we proposed a new framework called RUBRICC (Regulatory, Usability, Bias, Reliability [Evidence and Safety], Interoperability, Cost, and Codesign-Patient and Public Involvement and Engagement [PPIE]). This presents a valuable opportunity to direct the clinical commissioning of new LLM capabilities into health services, while respecting patient safety considerations.

Trial Registration: OSF Registries osf.io/xqzkw; https://osf.io/xqzkw.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC11576595PMC
http://dx.doi.org/10.2196/56532DOI Listing

Publication Analysis

Top Keywords

health care
28
accuracy
10
accuracy capability
8
health
8
systematic review
8
review meta-analysis
8
llm accuracy
8
care examination
8
search included
8
language model"
8

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!