Traditional best practices for quantitative structure activity relationship (QSAR) modeling recommend dataset balancing and balanced accuracy (BA) as the key desired objective of model development. This study explores the value of the conventional norms in the context of using QSAR models for virtual screening of modern large and ultra-large chemical libraries. For this increasingly common task, we now recommend the use of models with the highest positive predictive value (PPV) built on imbalanced training sets as preferred virtual screening tools. This recommendation stems from practical considerations of how the results of virtual screening are used in experimental laboratories where only a small fraction of virtually screened molecules can be tested using standard well plates. As a proof of concept, we have developed QSAR models for five expansive datasets with different ratios of active and inactive molecules and compared model performance in virtual screening using BA, PPV, and other metrics. We show that training on imbalanced datasets achieves a hit rate at least 30% higher than using balanced datasets, and that the PPV metric captured this difference of performance with no parameter tuning. Importantly, hit rates were estimated for top scoring compounds organized in batches of the size of plates (for instance, 128 molecules) used in the experimental high throughput screening. Based on the results of our studies, we posit that QSAR models trained on imbalanced datasets with the highest PPV should be relied upon to identify and test hit compounds in early drug discovery studies.
Download full-text PDF |
Source |
---|---|
http://dx.doi.org/10.1186/s13321-025-00948-y | DOI Listing |
Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!