Self-supervised learning plays an important role in molecular representation learning because labeled molecular data are usually limited in many tasks, such as chemical property prediction and virtual screening. However, most existing molecular pre-training methods focus on one modality of molecular data, and the complementary information of two important modalities, SMILES and graph, is not fully explored. In this study, we propose an effective multi-modality self-supervised learning framework for molecular SMILES and graph. Specifically, SMILES data and graph data are first tokenized so that they can be processed by a unified Transformer-based backbone network, which is trained by a masked reconstruction strategy. In addition, we introduce a specialized non-overlapping masking strategy to encourage fine-grained interaction between these two modalities. Experimental results show that our framework achieves state-of-the-art performance in a series of molecular property prediction tasks, and a detailed ablation study demonstrates efficacy of the multi-modality framework and the masking strategy.
Download full-text PDF |
Source |
---|---|
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC11129775 | PMC |
http://dx.doi.org/10.1093/bib/bbae256 | DOI Listing |
Nat Commun
January 2025
University of Pittsburgh, Department of Computer Science, Pittsburgh, PA, 15260, USA.
Reliable molecular property prediction is essential for various scientific endeavors and industrial applications, such as drug discovery. However, the data scarcity, combined with the highly non-linear causal relationships between physicochemical and biological properties and conventional molecular featurization schemes, complicates the development of robust molecular machine learning models. Self-supervised learning (SSL) has emerged as a popular solution, utilizing large-scale, unannotated molecular data to learn a foundational representation of chemical space that might be advantageous for downstream tasks.
View Article and Find Full Text PDFJ Chem Inf Model
January 2025
Research Unit Structural Chemistry and Computational Biophysics, Leibniz-Forschungsinstitut für Molekulare Pharmakologie, Berlin 13125, Germany.
Morphological profiling has recently demonstrated remarkable potential for identifying the biological activities of small molecules. Alongside the fully supervised and self-supervised machine learning methods recently proposed for bioactivity prediction from Cell Painting image data, we introduce here a semisupervised contrastive (SemiSupCon) learning approach. This approach combines the strengths of using biological annotations in supervised contrastive learning and leveraging large unannotated image data sets with self-supervised contrastive learning.
View Article and Find Full Text PDFRadiol Phys Technol
January 2025
Department of Diagnostic Imaging, Tohoku University Graduate School of Medicine, 2-1 Seiryo-machi, Aoba-ku, Sendai, Miyagi, 980-8575, Japan.
Self-supervised learning (SSL) has gained attention in the medical field as a deep learning approach utilizing unlabeled data. The Jigsaw puzzle task in SSL enables models to learn both features of images and the positional relationships within images. In breast cancer diagnosis, radiologists evaluate not only lesion-specific features but also the surrounding breast structures.
View Article and Find Full Text PDFMed Image Anal
December 2024
Faculty of Biomedical Engineering, Technion, Haifa, Israel. Electronic address:
Quantitative analysis of pseudo-diffusion in diffusion-weighted magnetic resonance imaging (DWI) data shows potential for assessing fetal lung maturation and generating valuable imaging biomarkers. Yet, the clinical utility of DWI data is hindered by unavoidable fetal motion during acquisition. We present IVIM-morph, a self-supervised deep neural network model for motion-corrected quantitative analysis of DWI data using the Intra-voxel Incoherent Motion (IVIM) model.
View Article and Find Full Text PDFBioData Min
January 2025
School of Computer Science, Fudan University, Shanghai, China.
This survey explores the transformative impact of foundation models (FMs) in artificial intelligence, focusing on their integration with federated learning (FL) in biomedical research. Foundation models such as ChatGPT, LLaMa, and CLIP, which are trained on vast datasets through methods including unsupervised pretraining, self-supervised learning, instructed fine-tuning, and reinforcement learning from human feedback, represent significant advancements in machine learning. These models, with their ability to generate coherent text and realistic images, are crucial for biomedical applications that require processing diverse data forms such as clinical reports, diagnostic images, and multimodal patient interactions.
View Article and Find Full Text PDFEnter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!