Introduction: The identification of metabolomic biomarkers predictive of cancer patient response to therapy and of disease stage has been pursued as a "holy grail" of modern oncology, relying on the metabolic dysfunction that characterizes cancer progression. In spite of the evaluation of many candidate biomarkers, however, determination of a consistent set with practical clinical utility has proven elusive.
Objective: In this study, we systematically examine the combined role of data pre-treatment and imputation methods on the performance of multivariate data analysis methods and their identification of potential biomarkers.
Outcomes for cancer patients have been previously estimated by applying various machine learning techniques to large datasets such as the Surveillance, Epidemiology, and End Results (SEER) program database. In particular for lung cancer, it is not well understood which types of techniques would yield more predictive information, and which data attributes should be used in order to determine this information. In this study, a number of supervised learning techniques is applied to the SEER database to classify lung cancer patients in terms of survival, including linear regression, Decision Trees, Gradient Boosting Machines (GBM), Support Vector Machines (SVM), and a custom ensemble.
View Article and Find Full Text PDFThis study applies unsupervised machine learning techniques for classification and clustering to a collection of descriptive variables from 10,442 lung cancer patient records in the Surveillance, Epidemiology, and End Results (SEER) program database. The goal is to automatically classify lung cancer patients into groups based on clinically measurable disease-specific variables in order to estimate survival. Variables selected as inputs for machine learning include Number of Primaries, Age, Grade, Tumor Size, Stage, and TNM, which are numeric or can readily be converted to numeric type.
View Article and Find Full Text PDF