The content accuracy of off-the-shelf large language models (LLMs) mirrors the content accuracy of the unregulated Internet from which these generative artificial intelligence models are supplied. With error rates approximating 30% in terms of treatment recommendations for the management of common musculoskeletal conditions, seeking expert opinion remains paramount. However, custom LLMs represent an excellent opportunity to infuse niche, bespoke expertise from the many specialties and subspecialties within medicine. Methods of customizing these generative models broadly fall under the categories of prompt engineering; "retrieval-augmented generation" prioritizing retrieval of relevant information from a specific domain of data; "fine-tuning" of a basic pretrained model into one that is refined for health care-related vernacular and acronyms; and "agentic augmentation" including software that breaks down complex tasks into smaller ones, recruiting multiple LLMs (with or without retrieval-augmented generation), optimizing the output, internally deciding whether the response is appropriate or sufficient, and even passing on an unmet outcome to a human for supervision ("phone a friend"). Custom LLMs offer physicians and their associated organizations the rare opportunity to regain control of our profession by re-establishing authority in our increasingly digital landscape.

Download full-text PDF

Source
http://dx.doi.org/10.1016/j.arthro.2024.09.047DOI Listing

Publication Analysis

Top Keywords

large language
12
language models
12
off-the-shelf large
8
treatment recommendations
8
content accuracy
8
custom llms
8
models
5
editorial commentary
4
commentary off-the-shelf
4
models insufficient
4

Similar Publications

Lies are ubiquitous and often happen in social interactions. However, socially conducted deceptions make it hard to get data since people are unlikely to self-report their intentional deception behaviors, especially malicious ones. Social deduction games, a type of social game where deception is a key gameplay mechanic, can be a good alternative to studying social deceptions.

View Article and Find Full Text PDF

Breast cancer is one of the most common malignant tumors in women worldwide. Although large language models (LLMs) can provide breast cancer nursing care consultation, inherent hallucinations can lead to inaccurate responses. Retrieval-augmented generation (RAG) technology can improve LLM performance, offering a new approach for clinical applications.

View Article and Find Full Text PDF

Large Language Models (LLMs) are gaining significant popularity in recent years for specialized tasks using prompts due to their low computational cost. Standard methods like prefix tuning utilize special, modifiable tokens that lack semantic meaning and require extensive training for best performance, often falling short. In this context, we propose a novel method called Semantic Knowledge Tuning (SK-Tuning) for prompt and prefix tuning that employs meaningful words instead of random tokens.

View Article and Find Full Text PDF

Health event prediction is empowered by the rapid and wide application of electronic health records (EHR). In the Intensive Care Unit (ICU), precisely predicting the health related events in advance is essential for providing treatment and intervention to improve the patients outcomes. EHR is a kind of multi-modal data containing clinical text, time series, structured data, etc.

View Article and Find Full Text PDF

Diagnosis, treatment, and prevention of ankle sprains: Comparing free chatbot recommendations with clinical guidelines.

Foot Ankle Surg

December 2024

Department of Trauma Surgery, Orthopaedics and Plastic Surgery, University of Göttingen, Robert-Koch-Str. 40, Göttingen 37075, Germany. Electronic address:

Background: Free chatbots powered by large language models offer lateral ankle sprains (LAS) treatment recommendations but lack scientific validation.

Methods: The chatbots-Claude, Perplexity, and ChatGPT-were evaluated by comparing their responses to a questionnaire and their treatment algorithms against current clinical guidelines. Responses were graded on accuracy, conclusiveness, supplementary information, and incompleteness, and evaluated individually and collectively, with a 60 % pass threshold.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!