Evaluating the ChatGPT family of models for biomedical reasoning and classification.

Shan Chen Yingya Li Sheng Lu Hoang Van Hugo J W L Aerts Guergana K Savova Danielle S Bitterman

J Am Med Inform Assoc

Artificial Intelligence in Medicine (AIM) Program, Mass General Brigham, Harvard Medical School, Boston, MA 02115, United States.

Published: April 2024

Objective: Large language models (LLMs) have shown impressive ability in biomedical question-answering, but have not been adequately investigated for more specific biomedical applications. This study investigates ChatGPT family of models (GPT-3.5, GPT-4) in biomedical tasks beyond question-answering.

Materials And Methods: We evaluated model performance with 11 122 samples for two fundamental tasks in the biomedical domain-classification (n = 8676) and reasoning (n = 2446). The first task involves classifying health advice in scientific literature, while the second task is detecting causal relations in biomedical literature. We used 20% of the dataset for prompt development, including zero- and few-shot settings with and without chain-of-thought (CoT). We then evaluated the best prompts from each setting on the remaining dataset, comparing them to models using simple features (BoW with logistic regression) and fine-tuned BioBERT models.

Results: Fine-tuning BioBERT produced the best classification (F1: 0.800-0.902) and reasoning (F1: 0.851) results. Among LLM approaches, few-shot CoT achieved the best classification (F1: 0.671-0.770) and reasoning (F1: 0.682) results, comparable to the BoW model (F1: 0.602-0.753 and 0.675 for classification and reasoning, respectively). It took 78 h to obtain the best LLM results, compared to 0.078 and 0.008 h for the top-performing BioBERT and BoW models, respectively.

Discussion: The simple BoW model performed similarly to the most complex LLM prompting. Prompt engineering required significant investment.

Conclusion: Despite the excitement around viral ChatGPT, fine-tuning for two fundamental biomedical natural language processing tasks remained the best strategy.

Download full-text PDF	Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC10990500	PMC
http://dx.doi.org/10.1093/jamia/ocad256	DOI Listing

Publication Analysis

Top Keywords

chatgpt family

family models

best classification

bow model

biomedical

models

reasoning

best

evaluating chatgpt

models biomedical

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!