In silico methods for prediction of chemical toxicity can decrease the cost and increase the efficiency in the early stage of drug discovery. However, due to low accessibility of sufficient and reliable toxicity data, constructing robust and accurate prediction models is challenging. Contrastive learning, a type of self-supervised learning, leverages large unlabeled data to obtain more expressive molecular representations, which can boost the prediction performance on downstream tasks. While molecular graph contrastive learning has gathered growing attentions, current models neglect the quality of negative data set. Here, we proposed a self-supervised pretraining deep learning framework named GCLmf. We first utilized molecular fragments that meet specific conditions as hard negative samples to boost the quality of the negative set and thus increase the difficulty of the proxy tasks during pre-training to learn informative representations. GCLmf has shown excellent predictive power on various molecular property benchmarks and demonstrates high performance in 33 toxicity tasks in comparison with multiple baselines. In addition, we further investigated the necessity of introducing hard negatives in model building and the impact of the proportion of hard negatives on the model.
Download full-text PDF |
Source |
---|---|
http://dx.doi.org/10.1002/minf.202400169 | DOI Listing |
Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!