Background: Machine learning (ML) represents a powerful tool to capture relationships between molecular alterations and cancer types and to extract biological information. Here, we developed a plain ML model aimed at distinguishing cancer types based on genetic lesions, providing an additional tool to improve cancer diagnosis, particularly for tumors of unknown origin.
Methods: TCGA data from 9,927 samples spanning 32 different cancer types were downloaded from cBioportal. A vector space model type data transformation technique was designed to build consistently homogeneous new datasets containing, as predictive features, calls for somatic point mutations and copy number variations at chromosome arm-level, thus allowing the use of the XGBoost classifier models. Considering the imbalance in the dataset, due to large difference in the number of cases for each tumor, two preprocessing strategies were considered: i) setting a percentage cut-off threshold to remove less represented cancer types, ii) dividing cancer types into different groups based on biological criteria and training a specific XGBoost model for each of them. The performance of all trained models was mainly assessed by the out-of-sample balanced accuracy (BACC) and the AUC scores.
Results: The XGBoost classifier achieved the best performance (BACC 77%; AUC 97%) on a dataset containing the 10 most represented tumor types. Moreover, dividing the 18 most represented cancers into three different groups (endocrine-related carcinomas, other carcinomas and other cancers),such analysis models achieved 78%, 71% and 86% BACC, respectively, with AUC scores greater than 96%. In addition, the model capable of linking each group to a specific cancer type reached 81% BACC and 94% AUC. Overall, the diagnostic potential of our model was comparable/higher with respect to others already described in literature and based on similar molecular data and ML approaches.
Conclusions: A boosted ML approach able to accurately discriminate different cancer types was developed. The methodology builds datasets simpler and more interpretable than the original data, while keeping enough information to accurately train standard ML models without resorting to sophisticated Deep Learning architectures. In combination with histopathological examinations, this approach could improve cancer diagnosis by using specific DNA alterations, processed by a replicable and easy-to-use automated technology. The study encourages new investigations which could further increase the classifier's performance, for example by considering more features and dividing tumors into their main molecular subtypes.
Download full-text PDF |
Source |
---|---|
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC10664515 | PMC |
http://dx.doi.org/10.1186/s12967-023-04720-4 | DOI Listing |
Sci Rep
January 2025
Gynecology Department, Affiliated Hospital of Nanjing University of Chinese Medicine, Nanjing, 210029, China.
The presence of high-risk human papillomavirus (HR-HPV) contributes to the development of cervical lesions and cervical cancer. Recent studies suggest that an imbalance in the cervicovaginal microbiota might be a factor in the persistence of HR-HPV infections. In this study, we collected 156 cervicovaginal fluid (CVF) of women with HR-HPV infection, which were divided into three groups (negative for intraepithelial lesions = 78, low/high-grade squamous intraepithelial lesions = 52/26).
View Article and Find Full Text PDFDiscov Oncol
January 2025
Department of Bioscience and Biotechnology, Banasthali Vidyapith, Niwai-Tonk, Rajasthan, 304022, India.
The prominence of circular RNAs (circRNAs) has surged in cancer research due to their distinctive properties and impact on cancer development. This review delves into the role of circRNAs in four key cancer types: colorectal cancer (CRC), gastric cancer (GC), liver cancer (HCC), and lung cancer (LUAD). The focus lies on their potential as cancer biomarkers and drug targets.
View Article and Find Full Text PDFSci Rep
January 2025
Institute for System Dynamics, University of Stuttgart, Waldburgstr. 19, 70563, Stuttgart, Germany.
Including sensor information in medical interventions aims to support surgeons to decide on subsequent action steps by characterizing tissue intraoperatively. With bladder cancer, an important issue is tumor recurrence because of failure to remove the entire tumor. Impedance measurements can help to classify bladder tissue and give the surgeons an indication on how much tissue to remove.
View Article and Find Full Text PDFSci Rep
January 2025
Research Institute for Applied Microelectronics (IUMA), University of Las Palmas de Gran Canaria (ULPGC), Las Palmas de Gran Canaria, Spain.
Cervical cancer remains a major global health concern, with a specially alarming incidence in younger women. Traditional detection techniques such as the Pap smear and colposcopy often lack sensitivity and specificity and are highly dependent on the experience of the gynaecologist. In response, this study proposes the use of Hyperspectral Imaging, a pioneering technology that combines traditional imaging with spectroscopy to provide detailed spatial and spectral information.
View Article and Find Full Text PDFNat Commun
January 2025
Lunenfeld-Tanenbaum Research Institute, Sinai Health System, Toronto, ON, Canada.
Spatial protein expression technologies can map cellular content and organization by simultaneously quantifying the expression of >40 proteins at subcellular resolution within intact tissue sections and cell lines. However, necessary image segmentation to single cells is challenging and error prone, easily confounding the interpretation of cellular phenotypes and cell clusters. To address these limitations, we present STARLING, a probabilistic machine learning model designed to quantify cell populations from spatial protein expression data while accounting for segmentation errors.
View Article and Find Full Text PDFEnter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!