Perplexity-Based Molecule Ranking and Bias Estimation of Chemical Language Models.

Michael Moret Francesca Grisoni Paul Katzberger Gisbert Schneider

J Chem Inf Model

Department of Chemistry and Applied Biosciences, ETH Zurich, RETHINK, Vladimir-Prelog-Weg 4, Zurich 8093, Switzerland.

Published: March 2022

Chemical language models (CLMs) can be employed to design molecules with desired properties. CLMs generate new chemical structures in the form of textual representations, such as the simplified molecular input line entry system (SMILES) strings. However, the quality of these de novo generated molecules is difficult to assess a priori. In this study, we apply the perplexity metric to determine the degree to which the molecules generated by a CLM match the desired design objectives. This model-intrinsic score allows identifying and ranking the most promising molecular designs based on the probabilities learned by the CLM. Using perplexity to compare "greedy" (beam search) with "explorative" (multinomial sampling) methods for SMILES generation, certain advantages of multinomial sampling become apparent. Additionally, perplexity scoring is performed to identify undesired model biases introduced during model training and allows the development of a new ranking system to remove those undesired biases.

Download full-text PDF	Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC8924923	PMC
http://dx.doi.org/10.1021/acs.jcim.2c00079	DOI Listing

Publication Analysis

Top Keywords

chemical language

language models

multinomial sampling

perplexity-based molecule

molecule ranking

ranking bias

bias estimation

estimation chemical

models chemical

models clms

Similar Publications

Pyridine Derivatives: A Comprehensive Review of Their Potential as Anti-Diabetic Agents.

Med Chem

January 2025

School of Pharmaceutical Sciences, Lovely Professional University, Phagwara, Punjab -14440, India.

Deepak Dua Prakash Kumar Riya Anand Salvi Sood Gurdeep Singh

Background: Diabetes mellitus and obesity are two of the most frequent health conditions in the world, prompting medical researchers to seek novel effective treatments. According to World Health Organization (WHO) regulations and several research studies, diabetes is regarded as a significant and leading health concern worldwide. The search for efficient and safe antidiabetic drugs has led to the study of pyridine derivatives, a family of molecules with a wide range of pharmacological characteristics.

View Article and Find Full Text PDF

Similar Publications

Drug-Induced Myoclonus: A Systematic Review.

Medicina (Kaunas)

January 2025

Neurology Department, Cooper University Hospital, Camden, NJ 08103, USA.

Jamir Pitton Rissardo Ana Letícia Fornari Caprara Nidhi Bhal Rishikulya Repudi Lea Zlatin

: Myoclonus is already associated with a wide variety of drugs and systemic conditions. As new components are discovered, more drugs are suspected of causing this disabling abnormal involuntary movement. This systematic review aims to assess the medications associated with drug-induced myoclonus (DIM).

View Article and Find Full Text PDF

Similar Publications

AI-Powered Neurogenetics: Supporting Patient's Evaluation with Chatbot.

Genes (Basel)

December 2024

Genomic Medicine Laboratory UILDM, IRCCS Santa Lucia Foundation, 00179 Rome, Italy.

Stefania Zampatti Juliette Farro Cristina Peconi Raffaella Cascella Claudia Strafella

Background/objectives: Artificial intelligence and large language models like ChatGPT and Google's Gemini are promising tools with remarkable potential to assist healthcare professionals. This study explores ChatGPT and Gemini's potential utility in assisting clinicians during the first evaluation of patients with suspected neurogenetic disorders.

Methods: By analyzing the model's performance in identifying relevant clinical features, suggesting differential diagnoses, and providing insights into possible genetic testing, this research seeks to determine whether these AI tools could serve as a valuable adjunct in neurogenetic assessments.

View Article and Find Full Text PDF

Similar Publications

Reference Values for Water-Specific T1, Intermuscular and Intramuscular Fat Content in Skeletal Muscle at 2.89 T.

J Magn Reson Imaging

January 2025

Department of Biomedical Engineering, University of Alberta, Edmonton, Alberta, Canada.

Stephen J Foulkes Mark J Haykowsky Rachel Sherrington Amy A Kirkham Justin Grenier

Background: MRI offers quantification of proton density fat fraction (PDFF) and tissue characteristics with T1 mapping. The influence of age, sex, and the potential confounding effects of fat on T1 values in skeletal muscle in healthy adults are insufficiently known.

Purpose: To determine the accuracy and repeatability of a saturation-recovery chemical-shift encoded multiparametric approach (SR-CSE) for quantification of T1 and muscle fat content, and establish normative values (age, sex) from a healthy cohort.

View Article and Find Full Text PDF

Similar Publications

Automating alloy design and discovery with physics-aware multimodal multiagent AI.

Proc Natl Acad Sci U S A

January 2025

Laboratory for Atomistic and Molecular Mechanics, Massachusetts Institute of Technology, Cambridge, MA 02139.

Alireza Ghafarollahi Markus J Buehler

The design of new alloys is a multiscale problem that requires a holistic approach that involves retrieving relevant knowledge, applying advanced computational methods, conducting experimental validations, and analyzing the results, a process that is typically slow and reserved for human experts. Machine learning can help accelerate this process, for instance, through the use of deep surrogate models that connect structural and chemical features to material properties, or vice versa. However, existing data-driven models often target specific material objectives, offering limited flexibility to integrate out-of-domain knowledge and cannot adapt to new, unforeseen challenges.

View Article and Find Full Text PDF

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!