Objectives: To externally validate the performance of a commercial AI software program for interpreting CXRs in a large, consecutive, real-world cohort from primary healthcare centres.
Methods: A total of 3047 CXRs were collected from two primary healthcare centres, characterised by low disease prevalence, between January and December 2018. All CXRs were labelled as normal or abnormal according to CT findings. Four radiology residents read all CXRs twice with and without AI assistance. The performances of the AI and readers with and without AI assistance were measured in terms of area under the receiver operating characteristic curve (AUROC), sensitivity, and specificity.
Results: The prevalence of clinically significant lesions was 2.2% (68 of 3047). The AUROC, sensitivity, and specificity of the AI were 0.648 (95% confidence interval [CI] 0.630-0.665), 35.3% (CI, 24.7-47.8), and 94.2% (CI, 93.3-95.0), respectively. AI detected 12 of 41 pneumonia, 3 of 5 tuberculosis, and 9 of 22 tumours. AI-undetected lesions tended to be smaller than true-positive lesions. The readers' AUROCs ranged from 0.534-0.676 without AI and 0.571-0.688 with AI (all p values < 0.05). For all readers, the mean reading time was 2.96-10.27 s longer with AI assistance (all p values < 0.05).
Conclusions: The performance of commercial AI in these high-volume, low-prevalence settings was poorer than expected, although it modestly boosted the performance of less-experienced readers. The technical prowess of AI demonstrated in experimental settings and approved by regulatory bodies may not directly translate to real-world practice, especially where the demand for AI assistance is highest.
Key Points: • This study shows the limited applicability of commercial AI software for detecting abnormalities in CXRs in a health screening population. • When using AI software in a specific clinical setting that differs from the training setting, it is necessary to adjust the threshold or perform additional training with such data that reflects this environment well. • Prospective test accuracy studies, randomised controlled trials, or cohort studies are needed to examine AI software to be implemented in real clinical practice.
Download full-text PDF |
Source |
---|---|
http://dx.doi.org/10.1007/s00330-022-09315-z | DOI Listing |
Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!