Coverage bias in small molecule machine learning.

Nat Commun

Chair for Bioinformatics, Institute for Computer Science, Friedrich Schiller University Jena, Jena, Germany.

Published: January 2025

Small molecule machine learning aims to predict chemical, biochemical, or biological properties from molecular structures, with applications such as toxicity prediction, ligand binding, and pharmacokinetics. A recent trend is developing end-to-end models that avoid explicit domain knowledge. These models assume no coverage bias in training and evaluation data, meaning the data are representative of the true distribution. However, the domain of applicability is rarely considered in such models. Here, we investigate how well large-scale datasets cover the space of known biomolecular structures. For doing so, we propose a distance measure based on solving the Maximum Common Edge Subgraph (MCES) problem, which aligns well with chemical similarity. Although this method is computationally hard, we introduce an efficient approach combining Integer Linear Programming and heuristic bounds. Our findings reveal that many widely-used datasets lack uniform coverage of biomolecular structures, limiting the predictive power of models trained on them. We propose two additional methods to assess whether training datasets diverge from known molecular distributions, potentially guiding future dataset creation to improve model performance.

Download full-text PDF

Source
http://dx.doi.org/10.1038/s41467-024-55462-wDOI Listing

Publication Analysis

Top Keywords

coverage bias
8
small molecule
8
molecule machine
8
machine learning
8
biomolecular structures
8
bias small
4
learning small
4
learning aims
4
aims predict
4
predict chemical
4

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!