Perplexity-Based Molecule Ranking and Bias Estimation of Chemical Language Models.

J Chem Inf Model

Department of Chemistry and Applied Biosciences, ETH Zurich, RETHINK, Vladimir-Prelog-Weg 4, Zurich 8093, Switzerland.

Published: March 2022

Chemical language models (CLMs) can be employed to design molecules with desired properties. CLMs generate new chemical structures in the form of textual representations, such as the simplified molecular input line entry system (SMILES) strings. However, the quality of these de novo generated molecules is difficult to assess a priori. In this study, we apply the perplexity metric to determine the degree to which the molecules generated by a CLM match the desired design objectives. This model-intrinsic score allows identifying and ranking the most promising molecular designs based on the probabilities learned by the CLM. Using perplexity to compare "greedy" (beam search) with "explorative" (multinomial sampling) methods for SMILES generation, certain advantages of multinomial sampling become apparent. Additionally, perplexity scoring is performed to identify undesired model biases introduced during model training and allows the development of a new ranking system to remove those undesired biases.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC8924923PMC
http://dx.doi.org/10.1021/acs.jcim.2c00079DOI Listing

Publication Analysis

Top Keywords

chemical language
8
language models
8
multinomial sampling
8
perplexity-based molecule
4
molecule ranking
4
ranking bias
4
bias estimation
4
estimation chemical
4
models chemical
4
models clms
4

Similar Publications

Background: Diabetes mellitus and obesity are two of the most frequent health conditions in the world, prompting medical researchers to seek novel effective treatments. According to World Health Organization (WHO) regulations and several research studies, diabetes is regarded as a significant and leading health concern worldwide. The search for efficient and safe antidiabetic drugs has led to the study of pyridine derivatives, a family of molecules with a wide range of pharmacological characteristics.

View Article and Find Full Text PDF

: Myoclonus is already associated with a wide variety of drugs and systemic conditions. As new components are discovered, more drugs are suspected of causing this disabling abnormal involuntary movement. This systematic review aims to assess the medications associated with drug-induced myoclonus (DIM).

View Article and Find Full Text PDF

AI-Powered Neurogenetics: Supporting Patient's Evaluation with Chatbot.

Genes (Basel)

December 2024

Genomic Medicine Laboratory UILDM, IRCCS Santa Lucia Foundation, 00179 Rome, Italy.

Background/objectives: Artificial intelligence and large language models like ChatGPT and Google's Gemini are promising tools with remarkable potential to assist healthcare professionals. This study explores ChatGPT and Gemini's potential utility in assisting clinicians during the first evaluation of patients with suspected neurogenetic disorders.

Methods: By analyzing the model's performance in identifying relevant clinical features, suggesting differential diagnoses, and providing insights into possible genetic testing, this research seeks to determine whether these AI tools could serve as a valuable adjunct in neurogenetic assessments.

View Article and Find Full Text PDF

Background: MRI offers quantification of proton density fat fraction (PDFF) and tissue characteristics with T1 mapping. The influence of age, sex, and the potential confounding effects of fat on T1 values in skeletal muscle in healthy adults are insufficiently known.

Purpose: To determine the accuracy and repeatability of a saturation-recovery chemical-shift encoded multiparametric approach (SR-CSE) for quantification of T1 and muscle fat content, and establish normative values (age, sex) from a healthy cohort.

View Article and Find Full Text PDF

Automating alloy design and discovery with physics-aware multimodal multiagent AI.

Proc Natl Acad Sci U S A

January 2025

Laboratory for Atomistic and Molecular Mechanics, Massachusetts Institute of Technology, Cambridge, MA 02139.

The design of new alloys is a multiscale problem that requires a holistic approach that involves retrieving relevant knowledge, applying advanced computational methods, conducting experimental validations, and analyzing the results, a process that is typically slow and reserved for human experts. Machine learning can help accelerate this process, for instance, through the use of deep surrogate models that connect structural and chemical features to material properties, or vice versa. However, existing data-driven models often target specific material objectives, offering limited flexibility to integrate out-of-domain knowledge and cannot adapt to new, unforeseen challenges.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!