A hitchhiker's guide to deep chemical language processing for bioactivity prediction.

Digit Discov

Eindhoven University of Technology, Institute for Complex Molecular Systems, Eindhoven AI Systems Institute, Dept. Biomedical Engineering Eindhoven Netherlands

Published: December 2024

Deep learning has significantly accelerated drug discovery, with 'chemical language' processing (CLP) emerging as a prominent approach. CLP approaches learn from molecular string representations (, Simplified Molecular Input Line Entry Systems [SMILES] and Self-Referencing Embedded Strings [SELFIES]) with methods akin to natural language processing. Despite their growing importance, training predictive CLP models is far from trivial, as it involves many 'bells and whistles'. Here, we analyze the key elements of CLP and provide guidelines for newcomers and experts. Our study spans three neural network architectures, two string representations, three embedding strategies, across ten bioactivity datasets, for both classification and regression purposes. This 'hitchhiker's guide' not only underscores the importance of certain methodological decisions, but it also equips researchers with practical recommendations on ideal choices, , in terms of neural network architectures, molecular representations, and hyperparameter optimization.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC11667676PMC
http://dx.doi.org/10.1039/d4dd00311jDOI Listing

Publication Analysis

Top Keywords

language processing
8
string representations
8
neural network
8
network architectures
8
hitchhiker's guide
4
guide deep
4
deep chemical
4
chemical language
4
processing bioactivity
4
bioactivity prediction
4

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!