AI Article Synopsis

  • Experimental errors in protein-ligand interaction measurements can negatively impact predictive models, yet they are often overlooked during model development.
  • A new approach utilizing a Probabilistic Random Forest (PRF) classifier aims to incorporate these experimental uncertainties to improve prediction accuracy across a large dataset (ChEMBL and PubChem).
  • The PRF demonstrated better performance (up to ~17% reduced median absolute error) for uncertain data points near decision thresholds compared to the standard Random Forest (RF), although RF showed stronger results for high-confidence classifications, potentially indicating overconfidence in its training.

Article Abstract

Measurements of protein-ligand interactions have reproducibility limits due to experimental errors. Any model based on such assays will consequentially have such unavoidable errors influencing their performance which should ideally be factored into modelling and output predictions, such as the actual standard deviation of experimental measurements (σ) or the associated comparability of activity values between the aggregated heterogenous activity units (i.e., K versus IC values) during dataset assimilation. However, experimental errors are usually a neglected aspect of model generation. In order to improve upon the current state-of-the-art, we herein present a novel approach toward predicting protein-ligand interactions using a Probabilistic Random Forest (PRF) classifier. The PRF algorithm was applied toward in silico protein target prediction across ~ 550 tasks from ChEMBL and PubChem. Predictions were evaluated by taking into account various scenarios of experimental standard deviations in both training and test sets and performance was assessed using fivefold stratified shuffled splits for validation. The largest benefit in incorporating the experimental deviation in PRF was observed for data points close to the binary threshold boundary, when such information was not considered in any way in the original RF algorithm. For example, in cases when σ ranged between 0.4-0.6 log units and when ideal probability estimates between 0.4-0.6, the PRF outperformed RF with a median absolute error margin of ~ 17%. In comparison, the baseline RF outperformed PRF for cases with high confidence to belong to the active class (far from the binary decision threshold), although the RF models gave errors smaller than the experimental uncertainty, which could indicate that they were overtrained and/or over-confident. Finally, the PRF models trained with putative inactives decreased the performance compared to PRF models without putative inactives and this could be because putative inactives were not assigned an experimental pXC value, and therefore they were considered inactives with a low uncertainty (which in practice might not be true). In conclusion, PRF can be useful for target prediction models in particular for data where class boundaries overlap with the measurement uncertainty, and where a substantial part of the training data is located close to the classification threshold.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC8375213PMC
http://dx.doi.org/10.1186/s13321-021-00539-7DOI Listing

Publication Analysis

Top Keywords

putative inactives
12
probabilistic random
8
random forest
8
close classification
8
classification threshold
8
experimental
8
experimental uncertainty
8
protein-ligand interactions
8
experimental errors
8
prf
8

Similar Publications

Background: Glycopeptide antibiotics (GPAs) are a very successful class of clinically relevant antibacterials, used to treat severe infections caused by Gram-positive pathogens, e.g., multidrug resistant and methicillin-resistant staphylococci.

View Article and Find Full Text PDF

Teratomas are a highly differentiated type of testicular germ cell tumors (TGCTs), the most common type of solid cancer in young men. Prominent inflammatory infiltrates are a hallmark of TGCTs, although their compositions and dynamics in teratomas remain elusive. Here, we reached out to characterize the infiltrating immune cells and their activation and polarization state by using high-throughput gene expression analysis of 129.

View Article and Find Full Text PDF

Serratia sp. ATCC 39006 has two tandemly positioned genes, ser4 and ser5, both annotated as sugar aminotransferases, in a putative secondary metabolite biosynthetic gene cluster. Ser5 possesses a complete fold-type I aminotransferase fold, while Ser4 lacks the N- and C-terminal regions and a catalytically important lysine residue of fold-type I aminotransferase.

View Article and Find Full Text PDF
Article Synopsis
  • The tree shrew (Tupaia belangeri) is being studied for its similarities to primates, particularly to understand DNA methylation and X chromosome inactivation (XCI) in their brains, using novel genome-wide methylation and transcriptomic data from their prefrontal cortex.
  • The study found that female tree shrews exhibit global hypomethylation of the X chromosome compared to males, but this hypomethylation does not directly cause X chromosome gene silencing or sex-specific gene expression, although it affects the Xist gene's regulation.
  • Overall, the research provides new insights into DNA methylation patterns in tree shrews and suggests that genomic CpG content may influence sex-differential
View Article and Find Full Text PDF

Many drugs have been discontinued during phase II/III breast cancer clinical trials due to lack of clinical efficacy, indicating shortcomings in predictive value of preclinical data. Nutrient availability in the tumour cell microenvironment and the dimensionality of in vitro tumour cells likely impact on drug responsiveness. Global proteomics experiments were conducted to assess the impact of nutrient availability and dimensionality of culture.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!