Large language models vs human for classifying clinical documents.

Int J Med Inform

College of Science and Engineering, James Cook University, Townsville, 4811, QLD, Australia. Electronic address:

Published: January 2025

Background: Accurate classification of medical records is crucial for clinical documentation, particularly when using the 10th revision of the International Classification of Diseases (ICD-10) coding system. The use of machine learning algorithms and Systematized Nomenclature of Medicine (SNOMED) mapping has shown promise in performing these classifications. However, challenges remain, particularly in reducing false negatives, where certain diagnoses are not correctly identified by either approach.

Objective: This study explores the potential of leveraging advanced large language models to improve the accuracy of ICD-10 classifications in challenging cases of medical records where machine learning and SNOMED mapping fail.

Methods: We evaluated the performance of ChatGPT 3.5 and ChatGPT 4 in classifying ICD-10 codes from discharge summaries within selected records of the Medical Information Mart for Intensive Care (MIMIC) IV dataset. These records comprised 802 discharge summaries identified as false negatives by both machine learning and SNOMED mapping methods, showing their challenging case. Each summary was assessed by ChatGPT 3.5 and 4 using a classification prompt, and the results were compared to human coder evaluations. Five human coders, with a combined experience of over 30 years, independently classified a stratified sample of 100 summaries to validate ChatGPT's performance.

Results: ChatGPT 4 demonstrated significantly improved consistency over ChatGPT 3.5, with matching results between runs ranging from 86% to 89%, compared to 57% to 67% for ChatGPT 3.5. The classification accuracy of ChatGPT 4 was variable across different ICD-10 codes. Overall, human coders performed better than ChatGPT. However, ChatGPT matched the median performance of human coders, achieving an accuracy rate of 22%.

Conclusion: This study underscores the potential of integrating advanced language models with clinical coding processes to improve documentation accuracy. ChatGPT 4 demonstrated improved consistency and comparable performance to median human coders, achieving 22% accuracy in challenging cases. Combining ChatGPT with methods like SNOMED mapping could further enhance clinical coding accuracy, particularly for complex scenarios.

Download full-text PDF

Source
http://dx.doi.org/10.1016/j.ijmedinf.2025.105800DOI Listing

Publication Analysis

Top Keywords

snomed mapping
16
human coders
16
language models
12
machine learning
12
chatgpt
11
large language
8
medical records
8
false negatives
8
challenging cases
8
learning snomed
8

Similar Publications

Large language models vs human for classifying clinical documents.

Int J Med Inform

January 2025

College of Science and Engineering, James Cook University, Townsville, 4811, QLD, Australia. Electronic address:

Background: Accurate classification of medical records is crucial for clinical documentation, particularly when using the 10th revision of the International Classification of Diseases (ICD-10) coding system. The use of machine learning algorithms and Systematized Nomenclature of Medicine (SNOMED) mapping has shown promise in performing these classifications. However, challenges remain, particularly in reducing false negatives, where certain diagnoses are not correctly identified by either approach.

View Article and Find Full Text PDF

Background And Objective: Despite significant investments in the normalization and the standardization of Electronic Health Records (EHRs), free text is still the rule rather than the exception in clinical notes. The use of free text has implications in data reuse methods used for supporting clinical research since the query mechanisms used in cohort definition and patient matching are mainly based on structured data and clinical terminologies. This study aims to develop a method for the secondary use of clinical text by: (a) using Natural Language Processing (NLP) for tagging clinical notes with biomedical terminology; and (b) designing an ontology that maps and classifies all the identified tags to various terminologies and allows for running phenotyping queries.

View Article and Find Full Text PDF

A pipeline for harmonising NHS Scotland laboratory data to enable national-level analyses.

J Biomed Inform

January 2025

Health Informatics Centre, School of Medicine, University of Dundee, UK; Health Data Research UK, London, UK. Electronic address:

Objective: Medical laboratory data together with prescribing and hospitalisation records are three of the most used electronic health records (EHRs) for data-driven health research. In Scotland, hospitalisation, prescribing and the death register data are available nationally whereas laboratory data is captured, stored and reported from local health board systems with significant heterogeneity. For researchers or other users of this regionally curated data, working on laboratory datasets across regional cohorts requires effort and time.

View Article and Find Full Text PDF

[Establishment of a German ICCR dataset : Translation and integration of SNOMED CT using the example of TUR-B].

Pathologie (Heidelb)

December 2024

Institut für Pathologie, Uniklinik RWTH Aachen, Pauwelsstraße 30, 52074, Aachen, Deutschland.

Background: The structured recording of data from histopathological findings and their interoperability is critical for quality assurance in pathology.

Materials And Methods: To harmonize the content of the reports, the International Collaboration on Cancer Reporting (ICCR) has defined standardized datasets. These datasets are not yet available in German nationwide.

View Article and Find Full Text PDF

Background: The digitisation of healthcare records has generated vast amounts of unstructured data, presenting opportunities for improvements in disease diagnosis when clinical coding falls short, such as in the recording of patient symptoms. This study presents an approach using natural language processing to extract clinical concepts from free-text which are used to automatically form diagnostic criteria for lung cancer from unstructured secondary-care data.

Methods: Patients aged 40 and above who underwent a chest x-ray (CXR) between 2016 and 2022 were included.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!