Importance: Large language models (LLMs) are crucial for medical tasks. Ensuring their reliability is vital to avoid false results. Our study assesses two state-of-the-art LLMs (ChatGPT and LlaMA-2) for extracting clinical information, focusing on cognitive tests like MMSE and CDR.

Objective: Evaluate ChatGPT and LlaMA-2 performance in extracting MMSE and CDR scores, including their associated dates.

Methods: Our data consisted of 135,307 clinical notes (Jan 12th, 2010 to May 24th, 2023) mentioning MMSE, CDR, or MoCA. After applying inclusion criteria 34,465 notes remained, of which 765 underwent ChatGPT (GPT-4) and LlaMA-2, and 22 experts reviewed the responses. ChatGPT successfully extracted MMSE and CDR instances with dates from 742 notes. We used 20 notes for fine-tuning and training the reviewers. The remaining 722 were assigned to reviewers, with 309 each assigned to two reviewers simultaneously. Inter-rater-agreement (Fleiss' Kappa), precision, recall, true/false negative rates, and accuracy were calculated. Our study follows TRIPOD reporting guidelines for model validation.

Results: For MMSE information extraction, ChatGPT (vs. LlaMA-2) achieved accuracy of 83% (vs. 66.4%), sensitivity of 89.7% (vs. 69.9%), true-negative rates of 96% (vs 60.0%), and precision of 82.7% (vs 62.2%). For CDR the results were lower overall, with accuracy of 87.1% (vs. 74.5%), sensitivity of 84.3% (vs. 39.7%), true-negative rates of 99.8% (98.4%), and precision of 48.3% (vs. 16.1%). We qualitatively evaluated the MMSE errors of ChatGPT and LlaMA-2 on double-reviewed notes. LlaMA-2 errors included 27 cases of total hallucination, 19 cases of reporting other scores instead of MMSE, 25 missed scores, and 23 cases of reporting only the wrong date. In comparison, ChatGPT's errors included only 3 cases of total hallucination, 17 cases of wrong test reported instead of MMSE, and 19 cases of reporting a wrong date.

Conclusions: In this diagnostic/prognostic study of ChatGPT and LlaMA-2 for extracting cognitive exam dates and scores from clinical notes, ChatGPT exhibited high accuracy, with better performance compared to LlaMA-2. The use of LLMs could benefit dementia research and clinical care, by identifying eligible patients for treatments initialization or clinical trial enrollments. Rigorous evaluation of LLMs is crucial to understanding their capabilities and limitations.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC10888985PMC
http://dx.doi.org/10.1101/2023.07.10.23292373DOI Listing

Publication Analysis

Top Keywords

chatgpt llama-2
20
mmse cdr
12
cases reporting
12
large language
8
language models
8
extracting cognitive
8
cognitive exam
8
exam dates
8
dates scores
8
llms crucial
8

Similar Publications

Objective: Evaluate the accuracy and reliability of various generative artificial intelligence (AI) models (ChatGPT-3.5, ChatGPT-4.0, T5, Llama-2, Mistral-Large, and Claude-3 Opus) in predicting Emergency Severity Index (ESI) levels for pediatric emergency department patients and assess the impact of medically oriented fine-tuning.

View Article and Find Full Text PDF

Large language models (LLMs) offer potential in primary dental care. We conducted an evaluation of LLMs' diagnostic capabilities across various oral diseases and contexts. All LLMs showed diagnostic capabilities for temporomandibular joint disorders, periodontal disease, dental caries, and malocclusion.

View Article and Find Full Text PDF

Ensuring reliability of Large Language Models (LLMs) in clinical tasks is crucial. Our study assesses two state-of-the-art LLMs (ChatGPT and LlaMA-2) for extracting clinical information, focusing on cognitive tests like MMSE and CDR. Our data consisted of 135,307 clinical notes (Jan 12th, 2010 to May 24th, 2023) mentioning MMSE, CDR, or MoCA.

View Article and Find Full Text PDF

Assessment of Large Language Models in Cataract Care Information Provision: A Quantitative Comparison.

Ophthalmol Ther

January 2025

Eye Center, The Second Affiliated Hospital, School of Medicine, Zhejiang University, Hangzhou, 310009, China.

Introduction: Cataracts are a significant cause of blindness. While individuals frequently turn to the Internet for medical advice, distinguishing reliable information can be challenging. Large language models (LLMs) have attracted attention for generating accurate, human-like responses that may be used for medical consultation.

View Article and Find Full Text PDF
Article Synopsis
  • Imaging utilization is rising, but many studies may not be suitable for specific clinical situations, prompting a need for better guidance.
  • This research examines the effectiveness of eight popular large language models (LLMs) in providing imaging recommendations for 24 common neuroradiology scenarios, as graded by expert neuroradiologists.
  • GPT-4 and ChatGPT performed the best, with GPT-4 achieving optimal recommendations in 23 out of 24 cases, while models like Llama 2 lagged significantly behind in accuracy and usefulness.
View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!