Background: Current interest surrounding large language models (LLMs) will lead to an increase in their use for medical advice. Although LLMs offer huge potential, they also pose potential misinformation hazards.
Objective: This study evaluates three LLMs answering urology-themed clinical case-based questions by comparing the quality of answers to those provided by urology consultants.
Methods: Forty-five case-based questions were answered by consultants and LLMs (ChatGPT 3.5, ChatGPT 4, Bard). Answers were blindly rated using a six-step Likert scale by four consultants in the categories: 'medical adequacy', 'conciseness', 'coherence' and 'comprehensibility'. Possible misinformation hazards were identified; a modified Turing test was included, and the character count was matched.
Results: Higher ratings in every category were recorded for the consultants. LLMs' overall performance in language-focused categories (coherence and comprehensibility) was relatively high. Medical adequacy was significantly poorer compared with the consultants. Possible misinformation hazards were identified in 2.8% to 18.9% of answers generated by LLMs compared with <1% of consultant's answers. Poorer conciseness rates and a higher character count were provided by LLMs. Among individual LLMs, ChatGPT 4 performed best in medical accuracy ( < 0.0001) and coherence ( = 0.001), whereas Bard received the lowest scores. Generated responses were accurately associated with their source with 98% accuracy in LLMs and 99% with consultants.
Conclusions: The quality of consultant answers was superior to LLMs in all categories. High semantic scores for LLM answers were found; however, the lack of medical accuracy led to potential misinformation hazards from LLM 'consultations'. Further investigations are necessary for new generations.
Download full-text PDF |
Source |
---|---|
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC11090772 | PMC |
http://dx.doi.org/10.1002/bco2.359 | DOI Listing |
J Med Internet Res
January 2025
Indiana University, Indianapolis, IN, United States.
Background: Heart failure (HF) is one of the most common causes of hospital readmission in the United States. These hospitalizations are often driven by insufficient self-care. Commercial mobile health (mHealth) technologies, such as consumer-grade apps and wearable devices, offer opportunities for improving HF self-care, but their efficacy remains largely underexplored.
View Article and Find Full Text PDFTop Cogn Sci
January 2025
Department of Anthropolgy, Indiana University.
Studies of the evolution of language rely heavily on comparisons to nonhuman primates, particularly the gestural communication of nonhuman apes. Differences between human and ape gestures are largely ones of degree rather than kind. For example, while human gestures are more flexible, ape gestures are not inflexible.
View Article and Find Full Text PDFPLoS One
January 2025
Department of Computer Science, IT University of Copenhagen, Copenhagen, Denmark.
Engaging in the deliberate generation of abnormal outputs from Large Language Models (LLMs) by attacking them is a novel human activity. This paper presents a thorough exposition of how and why people perform such attacks, defining LLM red-teaming based on extensive and diverse evidence. Using a formal qualitative methodology, we interviewed dozens of practitioners from a broad range of backgrounds, all contributors to this novel work of attempting to cause LLMs to fail.
View Article and Find Full Text PDFPLOS Digit Health
January 2025
Rwanda Ministry of Health, Kigali, Rwanda.
Community isolation of patients with communicable infectious diseases limits spread of pathogens but our understanding of isolated patients' needs and challenges is incomplete. Rwanda deployed a digital health service nationally to assist public health clinicians to remotely monitor and support SARS-CoV-2 cases via their mobile phones using daily interactive short message service (SMS) check-ins. We aimed to assess the texting patterns and communicated topics to better understand patient experiences.
View Article and Find Full Text PDFHosp Pediatr
January 2025
Department of Pediatric and Adolescent Medicine, Mayo Clinic, Rochester, Minnesota.
Background And Objectives: Some Minnesota clinicians perceive that the incidence of prophylactic vitamin K refusal is increasing, yet the actual incidence and which populations are most likely to refuse is unknown. Our objective is to identify the incidence of vitamin K refusal and to characterize the maternal-newborn dyads with increased refusal rates.
Methods: This retrospective multi-institution study analyzed vitamin K refusal in newborns born from 2015 to 2019.
Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!