A probabilistic molecular fingerprint for big data settings.

J Cheminform

Department of Chemistry and Biochemistry, National Center for Competence in Research NCCR TransCure, University of Berne, Freiestrasse 3, 3012, Bern, Switzerland.

Published: December 2018

Background: Among the various molecular fingerprints available to describe small organic molecules, extended connectivity fingerprint, up to four bonds (ECFP4) performs best in benchmarking drug analog recovery studies as it encodes substructures with a high level of detail. Unfortunately, ECFP4 requires high dimensional representations (≥ 1024D) to perform well, resulting in ECFP4 nearest neighbor searches in very large databases such as GDB, PubChem or ZINC to perform very slowly due to the curse of dimensionality.

Results: Herein we report a new fingerprint, called MinHash fingerprint, up to six bonds (MHFP6), which encodes detailed substructures using the extended connectivity principle of ECFP in a fundamentally different manner, increasing the performance of exact nearest neighbor searches in benchmarking studies and enabling the application of locality sensitive hashing (LSH) approximate nearest neighbor search algorithms. To describe a molecule, MHFP6 extracts the SMILES of all circular substructures around each atom up to a diameter of six bonds and applies the MinHash method to the resulting set. MHFP6 outperforms ECFP4 in benchmarking analog recovery studies. By leveraging locality sensitive hashing, LSH approximate nearest neighbor search methods perform as well on unfolded MHFP6 as comparable methods do on folded ECFP4 fingerprints in terms of speed and relative recovery rate, while operating in very sparse and high-dimensional binary chemical space.

Conclusion: MHFP6 is a new molecular fingerprint, encoding circular substructures, which outperforms ECFP4 for analog searches while allowing the direct application of locality sensitive hashing algorithms. It should be well suited for the analysis of large databases. The source code for MHFP6 is available on GitHub ( https://github.com/reymond-group/mhfp ).

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6755601PMC
http://dx.doi.org/10.1186/s13321-018-0321-8DOI Listing

Publication Analysis

Top Keywords

nearest neighbor
16
locality sensitive
12
sensitive hashing
12
molecular fingerprint
8
extended connectivity
8
fingerprint bonds
8
analog recovery
8
recovery studies
8
perform well
8
neighbor searches
8

Similar Publications

Air pollution is a critical global environmental issue, further exacerbated by rapid industrialization and urbanization. Accurate prediction of air pollutant concentrations is essential for effective pollution prevention and control measures. The complex nature of pollutant data is influenced by fluctuating meteorological conditions, diverse pollution sources, and propagation processes, underscores the crucial importance of the spatial and temporal feature extraction for accurately predicting air pollutant concentrations.

View Article and Find Full Text PDF

Ocrelizumab (OCR) and Natalizumab (NTZ) are highly effective treatments widely used in Multiple Sclerosis (MS). However, long-term, real-world comparative data on clinical effectiveness, safety and treatment persistence are limited. This retrospective analysis included relapsing and progressive MS patients initiating treatment at two Italian Universities ("La Sapienza" and "Federico II").

View Article and Find Full Text PDF

ST-CIRL: a reinforcement learning-based feature selection approach for enhanced anxiety classification.

Physiol Meas

January 2025

Department of Electronics and Communication , Delhi Technological University Department of Electronics and Communication, Delhi Technological university, Bawana, New Delhi-42, New Delhi, Delhi, 110042, INDIA.

A physiological signal-based Human-Computer Interaction (HCI) system provides a communication link between human emotional states and external devices. Accurately classifying these signals is vital for effective interaction, which requires extracting and selecting the most discriminative features to differentiate between various emotional states. This paper introduces the SMOTETomek-Correlated Interactive Reinforcement Learning (ST-CIRL) framework for anxiety classification, which leverages meta-descriptive statistics to enhance the state representation in the reinforcement learning process.

View Article and Find Full Text PDF

Although radiotherapy techniques are the primary treatment for head and neck cancer (HNC), they are still associated with substantial toxicity, and side effect. Machine learning (ML) based radiomics models for predicting toxicity mostly rely on features extracted from pre-treatment imaging data. This study aims to compare different models in predicting radiation-induced xerostomia and sticky saliva in both early and late stage of HNC patients using CT and MRI image features along with demographics and dosimetric information.

View Article and Find Full Text PDF

Why does silicon have an indirect band gap?

Mater Horiz

January 2025

Department of Materials Science, University of Michigan, Ann Arbor, Michigan 48109, USA.

It is difficult to intuit how electronic structure features-such as band gap magnitude, location of band extrema, effective masses, -arise from the underlying crystal chemistry of a material. Here we present a strategy to distill sparse and chemically-interpretable tight-binding models from density functional theory calculations, enabling us to interpret how multiple orbital interactions in a 3D crystal conspire to shape the overall band structure. Applying this process to silicon, we show that its indirect gap arises from a competition between first and second nearest-neighbor bonds-where second nearest-neighbor interactions pull the conduction band down from Γ to X in a cosine shape, but the first nearest-neighbor bonds push the band up near X, resulting in the characteristic dip of the silicon conduction band.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!