Background: In this study, we evaluate the accuracy, efficiency, and cost-effectiveness of large language models in extracting and structuring information from free-text clinical reports, particularly in identifying and classifying patient comorbidities within oncology electronic health records. We specifically compare the performance of gpt-3.5-turbo-1106 and gpt-4-1106-preview models against that of specialized human evaluators.

Objective: We specifically compare the performance of gpt-3.5-turbo-1106 and gpt-4-1106-preview models against that of specialized human evaluators.

Methods: We implemented a script using the OpenAI application programming interface to extract structured information in JavaScript object notation format from comorbidities reported in 250 personal history reports. These reports were manually reviewed in batches of 50 by 5 specialists in radiation oncology. We compared the results using metrics such as sensitivity, specificity, precision, accuracy, F-value, κ index, and the McNemar test, in addition to examining the common causes of errors in both humans and generative pretrained transformer (GPT) models.

Results: The GPT-3.5 model exhibited slightly lower performance compared to physicians across all metrics, though the differences were not statistically significant (McNemar test, P=.79). GPT-4 demonstrated clear superiority in several key metrics (McNemar test, P<.001). Notably, it achieved a sensitivity of 96.8%, compared to 88.2% for GPT-3.5 and 88.8% for physicians. However, physicians marginally outperformed GPT-4 in precision (97.7% vs 96.8%). GPT-4 showed greater consistency, replicating the exact same results in 76% of the reports across 10 repeated analyses, compared to 59% for GPT-3.5, indicating more stable and reliable performance. Physicians were more likely to miss explicit comorbidities, while the GPT models more frequently inferred nonexplicit comorbidities, sometimes correctly, though this also resulted in more false positives.

Conclusions: This study demonstrates that, with well-designed prompts, the large language models examined can match or even surpass medical specialists in extracting information from complex clinical reports. Their superior efficiency in time and costs, along with easy integration with databases, makes them a valuable tool for large-scale data mining and real-world evidence generation.

Download full-text PDF

Source
http://dx.doi.org/10.2196/58457DOI Listing

Publication Analysis

Top Keywords

mcnemar test
12
large language
8
language models
8
electronic health
8
health records
8
compare performance
8
performance gpt-35-turbo-1106
8
gpt-35-turbo-1106 gpt-4-1106-preview
8
gpt-4-1106-preview models
8
models specialized
8

Similar Publications

Background: Post-Covid syndrome has been associated to enduring impairments in functionality, cognition, mood and physical capabilities among older adults.

Methods: The objective was to prospectively evaluate clinical, cognitive and functional impairments in elderly people at 3 and 12 months after the diagnosis of Covid-19. Prospective cohort study of participants aged 60 years and over after a Covid-19 diagnosis.

View Article and Find Full Text PDF

Introduction: Identifying tuberculosis infection (TBI) using interferon-gamma release assays (IGRAs) is a primary component of clinical and public health efforts to prevent pediatric tuberculosis. Pediatric data comparing the two IGRAs in the United States are very limited. We compared the performance of the two IGRAs among a large pediatric cohort tested for TBI and assessed whether discordance might be due to quantitative results close to test cut-off values.

View Article and Find Full Text PDF

Background: In this study, we evaluate the accuracy, efficiency, and cost-effectiveness of large language models in extracting and structuring information from free-text clinical reports, particularly in identifying and classifying patient comorbidities within oncology electronic health records. We specifically compare the performance of gpt-3.5-turbo-1106 and gpt-4-1106-preview models against that of specialized human evaluators.

View Article and Find Full Text PDF
Article Synopsis
  • Individuals exposed to wildfires can develop mental health issues, but supportive text interventions may help mitigate these effects over time.
  • The study focused on evaluating a text message support service (Text4Hope) over three months in Alberta and Nova Scotia after the 2023 wildfires, involving participants who completed surveys to assess their mental health before and after the intervention.
  • Results indicated significant improvements in wellbeing and reductions in symptoms of depression, anxiety, and PTSD, suggesting the effectiveness of text message support in promoting mental health resilience after disasters.
View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!