Sensitivity and Specificity of Using GPT-3.5 Turbo Models for Title and Abstract Screening in Systematic Reviews and Meta-analyses.

Ann Intern Med

Université Paris Cité and Université Sorbonne Paris Nord, Inserm, INRAe, Centre for Research in Epidemiology and Statistics (CRESS), Paris, France; Centre d'Epidémiologie Clinique, Hôpital Hôtel-Dieu, AP-HP, Paris, France; and Department of Epidemiology, Columbia University Mailman School of Public Health, New York, New York (P.R.).

Published: June 2024

Background: Systematic reviews are performed manually despite the exponential growth of scientific literature.

Objective: To investigate the sensitivity and specificity of GPT-3.5 Turbo, from OpenAI, as a single reviewer, for title and abstract screening in systematic reviews.

Design: Diagnostic test accuracy study.

Setting: Unannotated bibliographic databases from 5 systematic reviews representing 22 665 citations.

Participants: None.

Measurements: A generic prompt framework to instruct GPT to perform title and abstract screening was designed. The output of the model was compared with decisions from authors under 2 rules. The first rule balanced sensitivity and specificity, for example, to act as a second reviewer. The second rule optimized sensitivity, for example, to reduce the number of citations to be manually screened.

Results: Under the balanced rule, sensitivities ranged from 81.1% to 96.5% and specificities ranged from 25.8% to 80.4%. Across all reviews, GPT identified 7 of 708 citations (1%) missed by humans that should have been included after full-text screening at the cost of 10 279 of 22 665 false-positive recommendations (45.3%) that would require reconciliation during the screening process. Under the sensitive rule, sensitivities ranged from 94.6% to 99.8% and specificities ranged from 2.2% to 46.6%. Limiting manual screening to citations not ruled out by GPT could reduce the number of citations to screen from 127 of 6334 (2%) to 1851 of 4077 (45.4%), at the cost of missing from 0 to 1 of 26 citations (3.8%) at the full-text level.

Limitations: Time needed to fine-tune prompt. Retrospective nature of the study, convenient sample of 5 systematic reviews, and GPT performance sensitive to prompt development and time.

Conclusion: The GPT-3.5 Turbo model may be used as a second reviewer for title and abstract screening, at the cost of additional work to reconcile added false positives. It also showed potential to reduce the number of citations before screening by humans, at the cost of missing some citations at the full-text level.

Primary Funding Source: None.

Download full-text PDF

Source
http://dx.doi.org/10.7326/M23-3389DOI Listing

Publication Analysis

Top Keywords

title abstract
16
abstract screening
16
systematic reviews
16
sensitivity specificity
12
gpt-35 turbo
12
reduce number
12
number citations
12
specificity gpt-35
8
screening
8
screening systematic
8

Similar Publications

Introduction: Cardiovascular disease (CVD) is the leading cause of death for women in the United States, and U.S. female Veterans have higher rates of CVD compared to civilian women.

View Article and Find Full Text PDF

Background: The 5-year prognosis of non-high-risk neuroblastomas is generally good (>90%). However, a proportion of patients show progression and succumb to their disease. We aimed to identify molecular aberrations (not incorporated in the current risk stratification) associated with overall survival (OS) and/or event-free survival (EFS) in patients diagnosed with non-high-risk neuroblastoma.

View Article and Find Full Text PDF

This scoping review aims to understand the cell-based meat production process, including the regulations, potential hazards, and critical points of this production. This review includes studies on cultured meat production processes, health hazards, and regulatory guidelines, excluding those without hazard analysis, incomplete texts, or studies published before 2013. The search was performed in eight electronic databases (MEDLINE, Web of Science, Embase, Cochrane Library, Scopus, LILACS, and Google Scholar) using MeSH terms and adaptations for each database.

View Article and Find Full Text PDF

Food neophobia and pickiness are the resistance or refusal to eat and/or avoid trying new foods due to a strong reaction of fear towards the food or an entire group of foods. This systematic review aims to assess evidence on the risk factors and effects of food neophobia and picky eating in children and adolescents, giving elements to avoid the lack of some foods that can cause nutritional deficiencies, leading to future pathologies when they are adults. A systematic literature search was performed in Medlars Online International Literature (MEDLINE) via Pubmed and EBSCOhost, LILACS and IBECS via Virtual Health Library (VHL), Scopus, and Google Scholar.

View Article and Find Full Text PDF

Background: Burnout is prevalent in healthcare professionals, especially among nurses. This review aims to examine the correlation between empathy and burnout as well as the variables that influence and mediate them.

Methods: This review follows the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guideline, to present a systematic evaluation of literature.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!