Background: Among the various molecular fingerprints available to describe small organic molecules, extended connectivity fingerprint, up to four bonds (ECFP4) performs best in benchmarking drug analog recovery studies as it encodes substructures with a high level of detail. Unfortunately, ECFP4 requires high dimensional representations (≥ 1024D) to perform well, resulting in ECFP4 nearest neighbor searches in very large databases such as GDB, PubChem or ZINC to perform very slowly due to the curse of dimensionality.
Results: Herein we report a new fingerprint, called MinHash fingerprint, up to six bonds (MHFP6), which encodes detailed substructures using the extended connectivity principle of ECFP in a fundamentally different manner, increasing the performance of exact nearest neighbor searches in benchmarking studies and enabling the application of locality sensitive hashing (LSH) approximate nearest neighbor search algorithms. To describe a molecule, MHFP6 extracts the SMILES of all circular substructures around each atom up to a diameter of six bonds and applies the MinHash method to the resulting set. MHFP6 outperforms ECFP4 in benchmarking analog recovery studies. By leveraging locality sensitive hashing, LSH approximate nearest neighbor search methods perform as well on unfolded MHFP6 as comparable methods do on folded ECFP4 fingerprints in terms of speed and relative recovery rate, while operating in very sparse and high-dimensional binary chemical space.
Conclusion: MHFP6 is a new molecular fingerprint, encoding circular substructures, which outperforms ECFP4 for analog searches while allowing the direct application of locality sensitive hashing algorithms. It should be well suited for the analysis of large databases. The source code for MHFP6 is available on GitHub ( https://github.com/reymond-group/mhfp ).
Download full-text PDF |
Source |
---|---|
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6755601 | PMC |
http://dx.doi.org/10.1186/s13321-018-0321-8 | DOI Listing |
Sci Rep
January 2025
School of Architecture and Urban Planning, Beijing University of Civil Engineering and Architecture, Beijing, 100055, China.
Air pollution is a critical global environmental issue, further exacerbated by rapid industrialization and urbanization. Accurate prediction of air pollutant concentrations is essential for effective pollution prevention and control measures. The complex nature of pollutant data is influenced by fluctuating meteorological conditions, diverse pollution sources, and propagation processes, underscores the crucial importance of the spatial and temporal feature extraction for accurately predicting air pollutant concentrations.
View Article and Find Full Text PDFNeurotherapeutics
January 2025
Department of Human Neurosciences, Sapienza University, Rome, Italy. Electronic address:
Ocrelizumab (OCR) and Natalizumab (NTZ) are highly effective treatments widely used in Multiple Sclerosis (MS). However, long-term, real-world comparative data on clinical effectiveness, safety and treatment persistence are limited. This retrospective analysis included relapsing and progressive MS patients initiating treatment at two Italian Universities ("La Sapienza" and "Federico II").
View Article and Find Full Text PDFPhysiol Meas
January 2025
Department of Electronics and Communication , Delhi Technological University Department of Electronics and Communication, Delhi Technological university, Bawana, New Delhi-42, New Delhi, Delhi, 110042, INDIA.
A physiological signal-based Human-Computer Interaction (HCI) system provides a communication link between human emotional states and external devices. Accurately classifying these signals is vital for effective interaction, which requires extracting and selecting the most discriminative features to differentiate between various emotional states. This paper introduces the SMOTETomek-Correlated Interactive Reinforcement Learning (ST-CIRL) framework for anxiety classification, which leverages meta-descriptive statistics to enhance the state representation in the reinforcement learning process.
View Article and Find Full Text PDFBiomed Phys Eng Express
January 2025
Radiation Oncology, Emory University, Emory Midtown Hospital, Atlanta, Georgia, 30322, UNITED STATES.
Although radiotherapy techniques are the primary treatment for head and neck cancer (HNC), they are still associated with substantial toxicity, and side effect. Machine learning (ML) based radiomics models for predicting toxicity mostly rely on features extracted from pre-treatment imaging data. This study aims to compare different models in predicting radiation-induced xerostomia and sticky saliva in both early and late stage of HNC patients using CT and MRI image features along with demographics and dosimetric information.
View Article and Find Full Text PDFMater Horiz
January 2025
Department of Materials Science, University of Michigan, Ann Arbor, Michigan 48109, USA.
It is difficult to intuit how electronic structure features-such as band gap magnitude, location of band extrema, effective masses, -arise from the underlying crystal chemistry of a material. Here we present a strategy to distill sparse and chemically-interpretable tight-binding models from density functional theory calculations, enabling us to interpret how multiple orbital interactions in a 3D crystal conspire to shape the overall band structure. Applying this process to silicon, we show that its indirect gap arises from a competition between first and second nearest-neighbor bonds-where second nearest-neighbor interactions pull the conduction band down from Γ to X in a cosine shape, but the first nearest-neighbor bonds push the band up near X, resulting in the characteristic dip of the silicon conduction band.
View Article and Find Full Text PDFEnter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!