Generative Large Language Models for Detection of Speech Recognition Errors in Radiology Reports.

Reuben A Schmidt Jarrel C Y Seah Ke Cao Lincoln Lim Wei Lim Justin Yeung

Radiol Artif Intell

From the Department of Medical Imaging, Western Health, Footscray, Australia (R.A.S., L.L., W.L.); Alfred Health, Harrison.ai, Monash University, Clayton, Australia (J.C.Y.S.); Department of Surgery, Western Precinct, University of Melbourne, Melbourne, Australia (K.C., J.Y.); and Department of Surgery, Western Health, Melbourne, Australia (J.Y.).

Published: March 2024

This study assessed how well generative large language models (LLMs) can identify speech recognition errors in a dataset of 3,233 radiology reports, comparing the performance of five different models.
GPT-4 performed the best, particularly in detecting clinically significant errors, achieving high precision, recall, and F1 scores compared to other models like GPT-3.5-turbo and Bard.
The research indicated that longer reports and certain dictation conditions correlated with higher error rates, suggesting potential for using LLMs for automated error detection in radiology.

This study evaluated the ability of generative large language models (LLMs) to detect speech recognition errors in radiology reports. A dataset of 3233 CT and MRI reports was assessed by radiologists for speech recognition errors. Errors were categorized as clinically significant or not clinically significant. Performances of five generative LLMs-GPT-3.5-turbo, GPT-4, text-davinci-003, Llama-v2-70B-chat, and Bard-were compared in detecting these errors, using manual error detection as the reference standard. Prompt engineering was used to optimize model performance. GPT-4 demonstrated high accuracy in detecting clinically significant errors (precision, 76.9%; recall, 100%; F1 score, 86.9%) and not clinically significant errors (precision, 93.9%; recall, 94.7%; F1 score, 94.3%). Text-davinci-003 achieved F1 scores of 72% and 46.6% for clinically significant and not clinically significant errors, respectively. GPT-3.5-turbo obtained 59.1% and 32.2% F1 scores, while Llama-v2-70B-chat scored 72.8% and 47.7%. Bard showed the lowest accuracy, with F1 scores of 47.5% and 20.9%. GPT-4 effectively identified challenging errors of nonsense phrases and internally inconsistent statements. Longer reports, resident dictation, and overnight shifts were associated with higher error rates. In conclusion, advanced generative LLMs show potential for automatic detection of speech recognition errors in radiology reports. CT, Large Language Model, Machine Learning, MRI, Natural Language Processing, Radiology Reports, Speech, Unsupervised Learning .

Download full-text PDF	Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC10982816	PMC
http://dx.doi.org/10.1148/ryai.230205	DOI Listing

Publication Analysis

Top Keywords

speech recognition

recognition errors

radiology reports

large language

errors radiology

clinically errors

errors

generative large

language models

detection speech

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!

A PHP Error was encountered

A PHP Error was encountered