Performance of a Breast Cancer Detection AI Algorithm Using the Personal Performance in Mammographic Screening Scheme.

Radiology

From the Department of Translational Medical Sciences, School of Medicine, University of Nottingham, Clinical Sciences Building, Nottingham City Hospital, City Hospital Campus, Hucknall Rd, Nottingham NG5 1PB, United Kingdom (Y.C., A.G.T., I.T.D.); and Nottingham Breast Institute, Nottingham University Hospitals NHS Trust, Nottingham, United Kingdom (J.J.J.).

Published: September 2023

Background The Personal Performance in Mammographic Screening (PERFORMS) scheme is used to assess reader performance. Whether this scheme can assess the performance of artificial intelligence (AI) algorithms is unknown. Purpose To compare the performance of human readers and a commercially available AI algorithm interpreting PERFORMS test sets. Materials and Methods In this retrospective study, two PERFORMS test sets, each consisting of 60 challenging cases, were evaluated by human readers between May 2018 and March 2021 and were evaluated by an AI algorithm in 2022. AI considered each breast separately, assigning a suspicion of malignancy score to features detected. Performance was assessed using the highest score per breast. Performance metrics, including sensitivity, specificity, and area under the receiver operating characteristic curve (AUC), were calculated for AI and humans. The study was powered to detect a medium-sized effect (odds ratio, 3.5 or 0.29) for sensitivity. Results A total of 552 human readers interpreted both PERFORMS test sets, consisting of 161 normal breasts, 70 malignant breasts, and nine benign breasts. No difference was observed at the breast level between the AUC for AI and the AUC for human readers (0.93% and 0.88%, respectively; = .15). When using the developer's suggested recall score threshold, no difference was observed for AI versus human reader sensitivity (84% and 90%, respectively; = .34), but the specificity of AI was higher (89%) than that of the human readers (76%, = .003). However, it was not possible to demonstrate equivalence due to the size of the test sets. When using recall thresholds to match mean human reader performance (90% sensitivity, 76% specificity), AI showed no differences inperformance, with a sensitivity of 91% ( =. 73) and a specificity of 77% ( = .85). Conclusion Diagnostic performance of AI was comparable with that of the average human reader when evaluating cases from two enriched test sets from the PERFORMS scheme. © RSNA, 2023 See also the editorial by Philpotts in this issue.

Download full-text PDF

Source
http://dx.doi.org/10.1148/radiol.223299DOI Listing

Publication Analysis

Top Keywords

human readers
20
test sets
20
performs test
12
human reader
12
performance
10
personal performance
8
performance mammographic
8
mammographic screening
8
performs scheme
8
scheme assess
8

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!