The rise of machine learning (ML) has created an explosion in the potential strategies for using data to make scientific predictions. For physical scientists wishing to apply ML strategies to a particular domain, it can be difficult to assess in advance what strategy to adopt within a vast space of possibilities. Here we outline the results of an online community-powered effort to swarm search the space of ML strategies and develop algorithms for predicting atomic-pairwise nuclear magnetic resonance (NMR) properties in molecules. Using an open-source dataset, we worked with Kaggle to design and host a 3-month competition which received 47,800 ML model predictions from 2,700 teams in 84 countries. Within 3 weeks, the Kaggle community produced models with comparable accuracy to our best previously published 'in-house' efforts. A meta-ensemble model constructed as a linear combination of the top predictions has a prediction accuracy which exceeds that of any individual model, 7-19x better than our previous state-of-the-art. The results highlight the potential of transformer architectures for predicting quantum mechanical (QM) molecular properties.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC8291653PMC
http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0253612PLOS

Publication Analysis

Top Keywords

machine learning
8
community-powered search
4
search machine
4
learning strategy
4
strategy space
4
space find
4
find nmr
4
nmr property
4
property prediction
4
prediction models
4

Similar Publications

Purpose: To quantify outer retina structural changes and define novel biomarkers of inherited retinal degeneration associated with biallelic mutations in RPE65 (RPE65-IRD) in patients before and after subretinal gene augmentation therapy with voretigene neparvovec (Luxturna).

Methods: Application of advanced deep learning for automated retinal layer segmentation, specifically tailored for RPE65-IRD. Quantification of five novel biomarkers for the ellipsoid zone (EZ): thickness, granularity, reflectivity, and intensity.

View Article and Find Full Text PDF

Women are disproportionately affected by chronic autoimmune diseases (AD) like systemic lupus erythematosus (SLE), scleroderma, rheumatoid arthritis (RA), and Sjögren's syndrome. Traditional evaluations often underestimate the associated cardiovascular disease (CVD) and stroke risk in women having AD. Vitamin D deficiency increases susceptibility to these conditions.

View Article and Find Full Text PDF

The combination of physiology and machine learning for prediction of CPAP pressure and residual AHI in OSA.

J Clin Sleep Med

January 2025

Division of Pulmonary, Critical Care, and Sleep Medicine, UC San Diego, San Diego, CA.

Continuous positive airway pressure (CPAP) is the treatment of choice for obstructive sleep apnea (OSA); however some people have residual respiratory events or require significantly higher CPAP pressure while on therapy. Our objective was to develop predictive models for CPAP outcomes and assess whether the inclusion of physiological traits enhances prediction. We constructed predictive models from baseline information for subsequent residual apnea-hypopnea index (AHI) and optimal CPAP pressure.

View Article and Find Full Text PDF

Active learning of molecular data for task-specific objectives.

J Chem Phys

January 2025

Department of Applied Physics, Aalto University, P.O. Box 11000, FI-00076 Aalto, Finland.

Active learning (AL) has shown promise to be a particularly data-efficient machine learning approach. Yet, its performance depends on the application, and it is not clear when AL practitioners can expect computational savings. Here, we carry out a systematic AL performance assessment for three diverse molecular datasets and two common scientific tasks: compiling compact, informative datasets and targeted molecular searches.

View Article and Find Full Text PDF

Unlabelled: Thousands of complete genome sequences for strains of a species that are now available enable the advancement of pangenome analytics to a new level of sophistication. We collected 2,377 publicly available complete genomes of for detailed pangenome analysis. The core genome and accessory genomes consisted of 2,398 and 5,182 genes, respectively.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!