- Recent advancements in large language models (LLMs), like GPT-3.5 and ChatGPT, have shown promise in performing well on tasks without needing extensive training, especially in medical evidence summarization across various clinical areas.
- This study evaluates these models through both automatic and human assessments and highlights that automatic metrics may not reliably reflect the actual quality of the summaries produced.
- Results indicate that LLMs can generate summaries that contain factual inaccuracies, make dubious or vague statements, and are particularly challenged when summarizing longer texts, raising concerns about the risk of misinformation in high-stakes medical settings.
Recent advancements in large language models (LLMs) like GPT-3.5 and ChatGPT show promise in performing zero- and few-shot tasks, including medical evidence summarization in various clinical areas.
*Our research includes both automatic and human evaluations to assess summary quality, revealing that automated metrics don’t always reflect the true quality of the summaries.
*We identified specific errors in the models' outputs, such as generating factually inconsistent information and struggling with longer texts, which raises concerns about the potential for misinformation in high-stakes medical contexts.*
The study examined electrocardiographic characteristics and mortality related to COVID-19 in hospitalized patients at three New York City hospitals, focusing on racial and ethnic minorities.
Out of 1,258 screened patients, 133 died, with a significant portion being male (55.6%) and from minority backgrounds (69.9%), and most having cardiovascular conditions.
Common arrhythmic deaths were linked to factors like age, coronary artery disease, asthma, and specific electrocardiographic abnormalities, with many patients receiving only comfort care before death.