Machine learning models have found numerous successful applications in computational drug discovery. A large body of these models represents molecules as sequences since molecular sequences are easily available, simple, and informative. The sequence-based models often segment molecular sequences into pieces called chemical words, analogous to the words that make up sentences in human languages, and then apply advanced natural language processing techniques for tasks such as de novo drug design, property prediction, and binding affinity prediction. However, the chemical characteristics and significance of these building blocks, chemical words, remain unexplored. To address this gap, we employ data-driven SMILES tokenization techniques such as Byte Pair Encoding, WordPiece, and Unigram to identify chemical words and compare the resulting vocabularies. To understand the chemical significance of these words, we build a language-inspired pipeline that treats high affinity ligands of protein targets as documents and selects key chemical words making up those ligands based on tf-idf weighting. The experiments on multiple protein-ligand affinity datasets show that despite differences in words, lengths, and validity among the vocabularies generated by different subword tokenization algorithms, the identified key chemical words exhibit similarity. Further, we conduct case studies on a number of target to analyze the impact of key chemical words on binding. We find that these key chemical words are specific to protein targets and correspond to known pharmacophores and functional groups. Our approach elucidates chemical properties of the words identified by machine learning models and can be used in drug discovery studies to determine significant chemical moieties.

Download full-text PDF

Source
http://dx.doi.org/10.1002/minf.202300249DOI Listing

Publication Analysis

Top Keywords

key chemical
16
chemical
12
smiles tokenization
8
machine learning
8
learning models
8
drug discovery
8
molecular sequences
8
protein targets
8
key
5
exploring data-driven
4

Similar Publications

Epoxides are versatile chemical intermediates that are used in the manufacture of diversified industrial products. For decades, thermochemical conversion has long been employed as the primary synthetic route. However, it has several drawbacks, such as harsh and explosive operating conditions, as well as a significant greenhouse gas emissions problem.

View Article and Find Full Text PDF

Dual Pathways of Photorelease Carbon Monoxide via Photosensitization for Tumor Treatment.

J Am Chem Soc

January 2025

State Key Laboratory of Medicinal Chemical Biology, Frontiers Science Centre for New Organic Matter, Tianjin Key Laboratory of Biosensing and Molecular Recognition, Research Centre for Analytical Sciences, College of Chemistry, School of Medicine and Frontiers Science Center for Cell Responses, Nankai University, Tianjin 300071, P. R. China.

Carbon monoxide (CO) gas therapy, as an emerging therapeutic strategy, is promising in tumor treatment. However, the development of a red or near-infrared light-driven efficient CO release strategy is still challenging due to the limited physicochemical characteristics of the photoactivated carbon monoxide-releasing molecules (photoCORMs). Here, we discovered a novel photorelease CO mechanism that involved dual pathways of CO release via photosensitization.

View Article and Find Full Text PDF

Protective Coating of Single-Crystalline Ni-Rich Cathode Enables Fast Charging in All-Solid-State Batteries.

ACS Nano

January 2025

Battery and Electrochemistry Laboratory (BELLA), Institute of Nanotechnology, Karlsruhe Institute of Technology (KIT), Kaiserstr. 12, Karlsruhe 76131, Germany.

Improving interfacial stability between cathode active material (CAM) and solid electrolyte (SE) is vital for developing high-performance all-solid-state batteries (ASSBs), with compatibility issues among the cell components representing a major challenge. CAM surface coating with a chemically inert ion conductor is a promising approach to suppress side reactions occurring at the cathode interfaces. Another strategy to mitigate mechanical degradation involves utilizing single-crystalline particle morphologies.

View Article and Find Full Text PDF

Direct Assembly of Grooved Micro/Nanofibrous Aerogel for High-Performance Thermal Insulation via Electrospinning.

ACS Appl Mater Interfaces

January 2025

CAS Key Laboratory of Green Process and Engineering, Institute of Process Engineering, Chinese Academy of Sciences, Beijing 100190, China.

Maintaining human body temperature in both high and low-temperature environments is fundamental to human survival, necessitating high-performance thermal insulation materials to prevent heat exchange with the external environment. Currently, most fibrous thermal insulation materials are characterized by large weight, suboptimal thermal insulation, and inferior mechanical and waterproof performance, thereby limiting their effectiveness in providing thermal protection for the human body. In this study, lightweight, waterproof, mechanically robust, and thermal insulating polyamide-imide (PAI) grooved micro/nanofibrous aerogels were efficiently and directly assembled by electrospinning.

View Article and Find Full Text PDF

Strategies and Prospects for Engineering a Stable Zn Metal Battery: Cathode, Anode, and Electrolyte Perspectives.

Acc Chem Res

January 2025

Department of Chemistry, Shanghai Key Laboratory of Catalysis and Innovative Materials, Center of Chemistry for Energy Materials Shanghai, Fudan University, Shanghai 200433, PR China.

ConspectusZinc metal batteries (ZMBs) appear to be promising candidates to replace lithium-ion batteries owing to their higher safety and lower cost. Moreover, natural reserves of Zn are abundant, being approximately 300 times greater than those of Li. However, there are some typical issues impeding the wide application of ZMBs.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!