ASTRAL-MP: scaling ASTRAL to very large datasets using randomization and parallelization.

Bioinformatics

Department of Electrical and Computer Engineering, University of California at San Diego, La Jolla, CA, USA.

Published: October 2019

Motivation: Evolutionary histories can change from one part of the genome to another. The potential for discordance between the gene trees has motivated the development of summary methods that reconstruct a species tree from an input collection of gene trees. ASTRAL is a widely used summary method and has been able to scale to relatively large datasets. However, the size of genomic datasets is quickly growing. Despite its relative efficiency, the current single-threaded implementation of ASTRAL is falling behind the data growth trends is not able to analyze the largest available datasets in a reasonable time.

Results: ASTRAL uses dynamic programing and is not trivially parallel. In this paper, we introduce ASTRAL-MP, the first version of ASTRAL that can exploit parallelism and also uses randomization techniques to speed up some of its steps. Importantly, ASTRAL-MP can take advantage of not just multiple CPU cores but also one or several graphics processing units (GPUs). The ASTRAL-MP code scales very well with increasing CPU cores, and its GPU version, implemented in OpenCL, can have up to 158× speedups compared to ASTRAL-III. Using GPUs and multiple cores, ASTRAL-MP is able to analyze datasets with 10 000 species or datasets with more than 100 000 genes in <2 days.

Availability And Implementation: ASTRAL-MP is available at https://github.com/smirarab/ASTRAL/tree/MP.

Supplementary Information: Supplementary data are available at Bioinformatics online.

Download full-text PDF

Source
http://dx.doi.org/10.1093/bioinformatics/btz211DOI Listing

Publication Analysis

Top Keywords

large datasets
8
gene trees
8
cpu cores
8
datasets
6
astral-mp
5
astral
5
astral-mp scaling
4
scaling astral
4
astral large
4
datasets randomization
4

Similar Publications

An empirical study of LLaMA3 quantization: from LLMs to MLLMs.

Vis Intell

December 2024

Department of Information Technology and Electrical Engineering, ETH Zurich, Sternwartstrasse 7, Zürich, Switzerland.

The LLaMA family, a collection of foundation language models ranging from 7B to 65B parameters, has become one of the most powerful open-source large language models (LLMs) and the popular LLM backbone of multi-modal large language models (MLLMs), widely used in computer vision and natural language understanding tasks. In particular, LLaMA3 models have recently been released and have achieved impressive performance in various domains with super-large scale pre-training on over 15T tokens of data. Given the wide application of low-bit quantization for LLMs in resource-constrained scenarios, we explore LLaMA3's capabilities when quantized to low bit-width.

View Article and Find Full Text PDF

tdCoxSNN: Time-dependent Cox survival neural network for continuous-time dynamic prediction.

J R Stat Soc Ser C Appl Stat

January 2025

Department of Biostatistics and Health Data Science, School of Public Health, University of Pittsburgh, Pittsburgh, PA, USA.

The aim of dynamic prediction is to provide individualized risk predictions over time, which are updated as new data become available. In pursuit of constructing a dynamic prediction model for a progressive eye disorder, age-related macular degeneration (AMD), we propose a time-dependent Cox survival neural network (tdCoxSNN) to predict its progression using longitudinal fundus images. tdCoxSNN builds upon the time-dependent Cox model by utilizing a neural network to capture the nonlinear effect of time-dependent covariates on the survival outcome.

View Article and Find Full Text PDF

Diagnosis and prognosis of melanoma from dermoscopy images using machine learning and deep learning: a systematic literature review.

BMC Cancer

January 2025

Department of Data Science, Faculty of Interdisciplinary Science and Technology, Tarbiat Modares University, Tehran, Iran.

Background: Melanoma is a highly aggressive skin cancer, where early and accurate diagnosis is crucial to improve patient outcomes. Dermoscopy, a non-invasive imaging technique, aids in melanoma detection but can be limited by subjective interpretation. Recently, machine learning and deep learning techniques have shown promise in enhancing diagnostic precision by automating the analysis of dermoscopy images.

View Article and Find Full Text PDF

Nutritional epidemiology aims to link dietary exposures to chronic disease, but the instruments for evaluating dietary intake are inaccurate. One way to identify unreliable data and the sources of errors is to compare estimated intakes with the total energy expenditure (TEE). In this study, we used the International Atomic Energy Agency Doubly Labeled Water Database to derive a predictive equation for TEE using 6,497 measures of TEE in individuals aged 4 to 96 years.

View Article and Find Full Text PDF

Soil spectroscopy is a widely used method for estimating soil properties that are important to environmental and agricultural monitoring. However, a bottleneck to its more widespread adoption is the need for establishing large reference datasets for training machine learning (ML) models, which are called soil spectral libraries (SSLs). Similarly, the prediction capacity of new samples is also subject to the number and diversity of soil types and conditions represented in the SSLs.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!