Improved GNNs for Log  Prediction by Transferring Knowledge from Low-Fidelity Data.

Yan-Jing Duan Li Fu Xiao-Chen Zhang Teng-Zhi Long Yuan-Hang He Zhao-Qian Liu Ai-Ping Lu Ya-Feng Deng Chang-Yu Hsieh Ting-Jun Hou Dong-Sheng Cao

J Chem Inf Model

Xiangya School of Pharmaceutical Sciences, Central South University, Changsha 410013, Hunan, P. R. China.

Published: April 2023

The -octanol/buffer solution distribution coefficient at pH = 7.4 (log ) is an indicator of lipophilicity, and it influences a wide variety of absorption, distribution, metabolism, excretion, and toxicity (ADMET) properties and druggability of compounds. In log  prediction, graph neural networks (GNNs) can uncover subtle structure-property relationships (SPRs) by automatically extracting features from molecular graphs that facilitate the learning of SPRs, but their performances are often limited by the small size of available datasets. Herein, we present a transfer learning strategy called pretraining on computational data and then fine-tuning on experimental data (PCFE) to fully exploit the predictive potential of GNNs. PCFE works by pretraining a GNN model on 1.71 million computational log  data (low-fidelity data) and then fine-tuning it on 19,155 experimental log  data (high-fidelity data). The experiments for three GNN architectures (graph convolutional network (GCN), graph attention network (GAT), and Attentive FP) demonstrated the effectiveness of PCFE in improving GNNs for log  predictions. Moreover, the optimal PCFE-trained GNN model (cx-Attentive FP, = 0.909) outperformed four excellent descriptor-based models (random forest (RF), gradient boosting (GB), support vector machine (SVM), and extreme gradient boosting (XGBoost)). The robustness of the cx-Attentive FP model was also confirmed by evaluating the models with different training data sizes and dataset splitting strategies. Therefore, we developed a webserver and defined the applicability domain for this model. The webserver (http://tools.scbdd.com/chemlogd/) provides free log  prediction services. In addition, the important descriptors for log  were detected by the Shapley additive explanations (SHAP) method, and the most relevant substructures of log  were identified by the attention mechanism. Finally, the matched molecular pair analysis (MMPA) was performed to summarize the contributions of common chemical substituents to log , including a variety of hydrocarbon groups, halogen groups, heteroatoms, and polar groups. In conclusion, we believe that the cx-Attentive FP model can serve as a reliable tool to predict log  and hope that pretraining on low-fidelity data can help GNNs make accurate predictions of other endpoints in drug discovery.

Download full-text PDF	Source
http://dx.doi.org/10.1021/acs.jcim.2c01564	DOI Listing

Publication Analysis

Top Keywords

log  prediction

low-fidelity data

log 

data

gnns log 

data fine-tuning

gnn model

log  data

gradient boosting

cx-attentive model

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!