Analyzing large scale genomic data on the cloud with Sparkhit.

Bioinformatics

Faculty of Technology, Bielefeld University, Bielefeld 33615, Germany.

Published: May 2018

Motivation: The increasing amount of next-generation sequencing data poses a fundamental challenge on large scale genomic analytics. Existing tools use different distributed computational platforms to scale-out bioinformatics workloads. However, the scalability of these tools is not efficient. Moreover, they have heavy run time overheads when pre-processing large amounts of data. To address these limitations, we have developed Sparkhit: a distributed bioinformatics framework built on top of the Apache Spark platform.

Results: Sparkhit integrates a variety of analytical methods. It is implemented in the Spark extended MapReduce model. It runs 92-157 times faster than MetaSpark on metagenomic fragment recruitment and 18-32 times faster than Crossbow on data pre-processing. We analyzed 100 terabytes of data across four genomic projects in the cloud in 21 h, which includes the run times of cluster deployment and data downloading. Furthermore, our application on the entire Human Microbiome Project shotgun sequencing data was completed in 2 h, presenting an approach to easily associate large amounts of public datasets with reference data.

Availability And Implementation: Sparkhit is freely available at: https://rhinempi.github.io/sparkhit/.

Contact: asczyrba@cebitec.uni-bielefeld.de.

Supplementary Information: Supplementary data are available at Bioinformatics online.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5925781PMC
http://dx.doi.org/10.1093/bioinformatics/btx808DOI Listing

Publication Analysis

Top Keywords

large scale
8
scale genomic
8
data
8
sequencing data
8
large amounts
8
times faster
8
analyzing large
4
genomic data
4
data cloud
4
sparkhit
4

Similar Publications

Pore-Controllable Synthesis of Phthalic Acid-Derived Hierarchical Activated Carbon for Dilute CO Capture.

Inorg Chem

December 2024

Textile Pollution Controlling Engineering Center of Ministry of Ecology and Environment, College of Environmental Science and Engineering, Donghua University, Shanghai 201620, China.

Carbon capture and storage (CCS) from dilute sources is an important strategy for stabilizing the concentration of atmospheric carbon dioxide and global temperature. However, the adsorption process is extremely challenging due to the sluggish diffusion rate of dilute CO. Herein, -phthalic acid (PTA)-derived hierarchical porous activated carbon (PTA-C) with abundant micro- and mesopores was successfully prepared for dilute CO (2 vol %) capture at ambient conditions.

View Article and Find Full Text PDF

Mass spectrometry (MS)-based metabolomics often rely on separation techniques when analyzing complex biological specimens to improve method resolution, metabolome coverage, quantitative performance, and/or unknown identification. However, low sample throughput and complicated data preprocessing procedures remain major barriers to affordable metabolomic studies that are scalable to large populations. Herein, we introduce PeakMeister as a new software tool in the R statistical environment to enable standardized processing of serum metabolomic data acquired by multisegment injection-capillary electrophoresis-mass spectrometry (MSI-CE-MS), a high-throughput separation platform (<4 min/sample) which takes advantage of a serial injection format of 13 samples within a single analytical run.

View Article and Find Full Text PDF

Is increased mutation driving genetic diversity in dogs within the Chornobyl exclusion zone?

PLoS One

December 2024

Department of Molecular Biomedical Sciences, College of Veterinary Medicine, North Carolina State University, Raleigh, NC, United States of America.

Environmental contamination can have lasting impacts on surrounding communities, though the long-term impacts can be difficult to ascertain. The disaster at the Chornobyl Nuclear Power Plant in 1986 and subsequent remediation efforts resulted in contamination of the local environment with radioactive material, heavy metals, and additional environmental toxicants. Many of these are mutagenic in nature, and the full effect of these exposures on local flora and fauna has yet to be understood.

View Article and Find Full Text PDF

Clarifying the pore-throat size and pore size distribution of tight sandstone reservoirs, quantitatively characterizing the heterogeneity of pore-throat structures, is crucial for evaluating reservoir effectiveness and predicting productivity. Through a series of rock physics experiments including gas measurement of porosity and permeability, casting thin sections, scanning electron microscopy, and high-pressure mercury injection, the quality of reservoir properties and microscopic pore-throat structure characteristics were systematically studied. Combined with fractal geometry theory, the effects of different pore throat types, geometric shapes and scale sizes on the fractal characteristics and heterogeneity of sandstone pore throat structure are clarified.

View Article and Find Full Text PDF

Hierarchical Porous Microspheres-Assisted Serum Metabolic Profile for the Early Diagnosis and Surveillance of Postmenopausal Osteoporosis.

Anal Chem

December 2024

Department of Chemistry, Institutes of Biomedical Sciences, Zhongshan Hospital, Fudan University, Shanghai 200433, China.

With the aging global population, the incidence of osteoporosis (OP) is increasing, putting more individuals at risk. Since postmenopausal osteoporosis (PMOP) often remains asymptomatic until a fracture occurs, making the early clinical diagnosis of PMOP particularly challenging. In this work, the AuNPs-anchored hierarchical porous ZrO microspheres (Au/HPZOMs) is designed to assist laser desorption/ionization mass spectrometry (LDI-MS) for the requirement of serum metabolic fingerprints of PMOP, postmenopausal osteopenia (PMON), and healthy controls (HC) and realize the early diagnosis and surveillance of PMOP.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!