Purpose: To analyze the accuracy and thoroughness of three large language models (LLMs) to produce information for providers about immune checkpoint inhibitor ocular toxicities.

Methods: Eight questions were created about the general definition of checkpoint inhibitors, their mechanism of action, ocular toxicities, and toxicity management. All were inputted into ChatGPT 4.0, Bard, and LLaMA programs. Using the six-point Likert scale for accuracy and completeness, four ophthalmologists who routinely treat ocular toxicities of immunotherapy agents rated the LLMs answers. Analysis of variance testing was used to assess significant differences among the three LLMs and a post hoc pairwise t -test. Fleiss kappa values were calculated to account for interrater variability.

Results: ChatGPT responses were rated with an average of 4.59 for accuracy and 4.09 for completeness; Bard answers were rated 4.59 and 4.19; LLaMA results were rated 4.38 and 4.03. The three LLMs did not significantly differ in accuracy ( P = 0.47) nor completeness ( P = 0.86). Fleiss kappa values were found to be poor for both accuracy (-0.03) and completeness (0.01).

Conclusion: All three LLMs provided highly accurate and complete responses to questions centered on immune checkpoint inhibitor ocular toxicities and management. Further studies are needed to assess specific immune checkpoint inhibitor agents and the accuracy and completeness of updated versions of LLMs.

Download full-text PDF

Source
http://dx.doi.org/10.1097/IAE.0000000000004271DOI Listing

Publication Analysis

Top Keywords

immune checkpoint
16
ocular toxicities
16
accuracy completeness
12
checkpoint inhibitor
12
three llms
12
large language
8
language models
8
checkpoint inhibitors
8
inhibitor ocular
8
fleiss kappa
8

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!