Background: Previous studies of artificial intelligence (AI) applied to dermatology have shown AI to have higher diagnostic classification accuracy than expert dermatologists; however, these studies did not adequately assess clinically realistic scenarios, such as how AI systems behave when presented with images of disease categories that are not included in the training dataset or images drawn from statistical distributions with significant shifts from training distributions. We aimed to simulate these real-world scenarios and evaluate the effects of image source institution, diagnoses outside of the training set, and other image artifacts on classification accuracy, with the goal of informing clinicians and regulatory agencies about safety and real-world accuracy.

Methods: We designed a large dermoscopic image classification challenge to quantify the performance of machine learning algorithms for the task of skin cancer classification from dermoscopic images, and how this performance is affected by shifts in statistical distributions of data, disease categories not represented in training datasets, and imaging or lesion artifacts. Factors that might be beneficial to performance, such as clinical metadata and external training data collected by challenge participants, were also evaluated. 25 331 training images collected from two datasets (in Vienna [HAM10000] and Barcelona [BCN20000]) between Jan 1, 2000, and Dec 31, 2018, across eight skin diseases, were provided to challenge participants to design appropriate algorithms. The trained algorithms were then tested for balanced accuracy against the HAM10000 and BCN20000 test datasets and data from countries not included in the training dataset (Turkey, New Zealand, Sweden, and Argentina). Test datasets contained images of all diagnostic categories available in training plus other diagnoses not included in training data (not trained category). We compared the performance of the algorithms against that of 18 dermatologists in a simulated setting that reflected intended clinical use.

Findings: 64 teams submitted 129 state-of-the-art algorithm predictions on a test set of 8238 images. The best performing algorithm achieved 58·8% balanced accuracy on the BCN20000 data, which was designed to better reflect realistic clinical scenarios, compared with 82·0% balanced accuracy on HAM10000, which was used in a previously published benchmark. Shifted statistical distributions and disease categories not included in training data contributed to decreases in accuracy. Image artifacts, including hair, pen markings, ulceration, and imaging source institution, decreased accuracy in a complex manner that varied based on the underlying diagnosis. When comparing algorithms to expert dermatologists (2460 ratings on 1269 images), algorithms performed better than experts in most categories, except for actinic keratoses (similar accuracy on average) and images from categories not included in training data (26% correct for experts vs 6% correct for algorithms, p<0·0001). For the top 25 submitted algorithms, 47·1% of the images from categories not included in training data were misclassified as malignant diagnoses, which would lead to a substantial number of unnecessary biopsies if current state-of-the-art AI technologies were clinically deployed.

Interpretation: We have identified specific deficiencies and safety issues in AI diagnostic systems for skin cancer that should be addressed in future diagnostic evaluation protocols to improve safety and reliability in clinical practice.

Funding: Melanoma Research Alliance and La Marató de TV3.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC9295694PMC
http://dx.doi.org/10.1016/S2589-7500(22)00021-8DOI Listing

Publication Analysis

Top Keywords

included training
20
training data
16
disease categories
12
categories included
12
statistical distributions
12
balanced accuracy
12
training
11
images
9
artificial intelligence
8
skin cancer
8

Similar Publications

Exercise activates autophagy and lysosome system in skeletal muscle, which are known to play an important role in metabolic adaptation. However, the mechanism of exercise-activated autophagy and lysosome system in obese insulin resistance remains covert. In this study, we investigated the role of exercise-induced activation of autophagy and lysosome system in improving glucose metabolism of skeletal muscle.

View Article and Find Full Text PDF

Right ventricular injury (RVI) in respiratory failure receiving veno-venous extracorporeal membrane oxygenation (VV ECMO) is associated with significant mortality. A scoping review is necessary to map the current literature and guide future research regarding the definition and management of RVI in patients receiving VV ECMO. We searched for relevant publications on RVI in patients receiving VV ECMO in Medline, EMBASE, and Web of Science.

View Article and Find Full Text PDF

A New Way Forward for Women's Health Research at the National Institutes of Health: A Roadmap From the National Academies of Sciences, Engineering, and Medicine's Consensus Report.

Obstet Gynecol

January 2025

Department of Obstetrics and Gynecology, Spencer Fox Eccles School of Medicine, University of Utah Health, Salt Lake City, Utah; the Department of Obstetrics and Gynecology, Warren Alpert Medical School at Brown University, and Women and Infants Hospital of Rhode Island, Providence, Rhode Island; the National Academies of Sciences, Engineering, and Medicine, and Baker Donelson, Washington, DC; KFF, San Francisco, California; and the Department of Obstetrics and Gynecology, Duke Cancer Institute, Duke School of Medicine, Durham, North Carolina. All authors served on the National Academies Committee as committee members or employees of the National Academies.

Despite efforts to address inequities, research on women's health conditions (defined as those that uniquely or differently affect women and female individuals) remain significantly understudied. As directed by Congress, the National Institutes of Health (NIH) Office of Research on Women's Health requested the National Academies of Sciences, Engineering, and Medicine (National Academies) to conduct an assessment of the state of women's health research at the NIH. The findings of the National Academies committee include: 1) a significant funding inequity, with less than 8% of the total NIH grant budget for fiscal year 2023 allocated to women's health research; 2) a need for improved strategic NIH-wide priority setting, oversight, and adherence to existing policies to support women's health research; 3) a need for a specific institute for research on conditions specific to women's health; and 4) a need for sufficient training and additional funding to grow and retain the women's health research workforce.

View Article and Find Full Text PDF

Background: The mental health crisis among college students intensified amid the COVID-19 pandemic, suggesting an urgent need for innovative solutions to support them. Previous efforts to address mental health concerns have been constrained, often due to the underuse or shortage of services. Mobile health (mHealth) technology holds significant potential for providing resilience-building support and enhancing access to mental health care.

View Article and Find Full Text PDF

Background: Chronic kidney disease (CKD) imposes a significant global health and economic burden, impacting millions globally. Despite its high prevalence, public awareness and understanding of CKD remain limited, leading to delayed diagnosis and suboptimal management. Traditional patient education methods, such as 1-on-1 verbal instruction or printed brochures, are often insufficient, especially considering the shortage of nursing staff.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!