A decision-theoretic approach to the evaluation of machine learning algorithms in computational drug discovery.

Oliver P Watson Isidro Cortes-Ciriano Aimee R Taylor James A Watson

Bioinformatics

Nuffield Department of Medicine, Centre for Tropical Medicine and Global Health, University of Oxford, Oxford OX3, 7LF UK.

Published: November 2019

Motivation: Artificial intelligence, trained via machine learning (e.g. neural nets, random forests) or computational statistical algorithms (e.g. support vector machines, ridge regression), holds much promise for the improvement of small-molecule drug discovery. However, small-molecule structure-activity data are high dimensional with low signal-to-noise ratios and proper validation of predictive methods is difficult. It is poorly understood which, if any, of the currently available machine learning algorithms will best predict new candidate drugs.

Results: The quantile-activity bootstrap is proposed as a new model validation framework using quantile splits on the activity distribution function to construct training and testing sets. In addition, we propose two novel rank-based loss functions which penalize only the out-of-sample predicted ranks of high-activity molecules. The combination of these methods was used to assess the performance of neural nets, random forests, support vector machines (regression) and ridge regression applied to 25 diverse high-quality structure-activity datasets publicly available on ChEMBL. Model validation based on random partitioning of available data favours models that overfit and 'memorize' the training set, namely random forests and deep neural nets. Partitioning based on quantiles of the activity distribution correctly penalizes extrapolation of models onto structurally different molecules outside of the training data. Simpler, traditional statistical methods such as ridge regression can outperform state-of-the-art machine learning methods in this setting. In addition, our new rank-based loss functions give considerably different results from mean squared error highlighting the necessity to define model optimality with respect to the decision task at hand.

Availability And Implementation: All software and data are available as Jupyter notebooks found at https://github.com/owatson/QuantileBootstrap.

Supplementary Information: Supplementary data are available at Bioinformatics online.

Download full-text PDF	Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6853675	PMC
http://dx.doi.org/10.1093/bioinformatics/btz293	DOI Listing

Publication Analysis

Top Keywords

machine learning

neural nets

random forests

ridge regression

learning algorithms

drug discovery

nets random

support vector

vector machines

model validation

Similar Publications

Biologically-targeted discovery-replication scan identifies G×G interaction in relation to risk of Barrett's esophagus and esophageal adenocarcinoma.

HGG Adv

January 2025

Department of Epidemiology and Biostatistics, Memorial Sloan Kettering Cancer Center, New York, NY, USA.

Li Yan Qianchuan He Shiv P Verma Xu Zhang Ann-Sophie Giel

Inherited genetics represents an important contributor to risk of esophageal adenocarcinoma (EAC), and its precursor Barrett's esophagus (BE). Genome-wide association studies have identified ∼30 susceptibility variants for BE/EAC, yet genetic interactions remain unexamined. To address challenges in large-scale G×G scans, we combined knowledge-guided filtering and machine learning approaches, focusing on genes with (A) known/plausible links to BE/EAC pathogenesis (n=493) or (B) prior evidence of biological interactions (n=4,196).

View Article and Find Full Text PDF

Similar Publications

Cognitive load detection through EEG lead wise feature optimization and ensemble classification.

Sci Rep

January 2025

Department of ECE, Kallam Haranadhareddy Institute of Technology, Guntur, Andhra Pradesh, India.

Jammisetty Yedukondalu Kalyani Sunkara Vankayalapati Radhika Sivakrishna Kondaveeti Murali Anumothu

Cognitive load stimulates neural activity, essential for understanding the brain's response to stress-inducing stimuli or mental strain. This study examines the feasibility of evaluating cognitive load by extracting, selection, and classifying features from electroencephalogram (EEG) signals. We employed robust local mean decomposition (R-LMD) to decompose EEG data from each channel, recorded over a four-second period, into five modes.

View Article and Find Full Text PDF

Similar Publications

Machine learning assisted classification RASAR modeling for the nephrotoxicity potential of a curated set of orally active drugs.

Sci Rep

January 2025

Drug Theoretics and Cheminformatics Laboratory, Department of Pharmaceutical Technology, Jadavpur University, Kolkata, 700 032, India.

Arkaprava Banerjee Kunal Roy

We have adopted the classification Read-Across Structure-Activity Relationship (c-RASAR) approach in the present study for machine-learning (ML)-based model development from a recently reported curated dataset of nephrotoxicity potential of orally active drugs. We initially developed ML models using nine different algorithms separately on topological descriptors (referred to as simply "descriptors" in the subsequent sections of the manuscript) and MACCS fingerprints (referred to as "fingerprints" in the subsequent sections of the manuscript), thus generating 18 different ML QSAR models. Using the chemical spaces defined by the modeling descriptors and fingerprints, the similarity and error-based RASAR descriptors were computed, and the most discriminating RASAR descriptors were used to develop another set of 18 different ML c-RASAR models.

View Article and Find Full Text PDF

Similar Publications

Machine learning techniques for non-destructive estimation of plum fruit weight.

Sci Rep

January 2025

Crop and Horticultural Science Research Department, Mazandaran Agricultural Resources Research and Education Center, Agricultural Research, Education and Extension Organization (AREEO), Tajrish, Iran.

Atefeh Sabouri Adel Bakhshipour Mehrnaz Poorsalehi Abouzar Abouzari

Plum fruit fresh weight (FW) estimation is crucial for various agricultural practices, including yield prediction, quality control, and market pricing. Traditional methods for estimating fruit weight are often destructive, time-consuming, and labor-intensive. In this study, we addressed the problem of predicting plum FW using artificial intelligence (AI) methods based on fruit dimensions.

View Article and Find Full Text PDF

Similar Publications

Prognostic implications and therapeutic opportunities related to CAF subtypes in CMS4 colorectal cancer: insights from single-cell and bulk transcriptomics.

Apoptosis

January 2025

Department of Pathology, Fudan University Shanghai Cancer Center, Shanghai, China.

Mengke Ma Jin Chu Changhua Zhuo Xin Xiong Wenchao Gu

Cancer-associated fibroblasts (CAFs) significantly influence tumor progression and therapeutic resistance in colorectal cancer (CRC). However, the distributions and functions of CAF subpopulations vary across the four consensus molecular subtypes (CMSs) of CRC. This study performed single-cell RNA and bulk RNA sequencing and revealed that myofibroblast-like CAFs (myCAFs), tumor-like CAFs (tCAFs), inflammatory CAFs (iCAFs), CXCL14CAFs, and MTCAFs are notably enriched in CMS4 compared with other CMSs of CRC.

View Article and Find Full Text PDF

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!