Biospark: scalable analysis of large numerical datasets from biological simulations and experiments using Hadoop and Spark.

Max Klein Rati Sharma Chris H Bohrer Cameron M Avelis Elijah Roberts

Bioinformatics

Department of Biophysics, Johns Hopkins University, Baltimore, MD 21218, USA.

Published: January 2017

Unlabelled: Data-parallel programming techniques can dramatically decrease the time needed to analyze large datasets. While these methods have provided significant improvements for sequencing-based analyses, other areas of biological informatics have not yet adopted them. Here, we introduce Biospark, a new framework for performing data-parallel analysis on large numerical datasets. Biospark builds upon the open source Hadoop and Spark projects, bringing domain-specific features for biology.

Availability And Implementation: Source code is licensed under the Apache 2.0 open source license and is available at the project website: https://www.assembla.com/spaces/roberts-lab-public/wiki/Biospark CONTACT: eroberts@jhu.eduSupplementary information: Supplementary data are available at Bioinformatics online.

Download full-text PDF	Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6276899	PMC
http://dx.doi.org/10.1093/bioinformatics/btw614	DOI Listing

Publication Analysis

Top Keywords

analysis large

large numerical

numerical datasets

hadoop spark

open source

biospark scalable

scalable analysis

datasets biological

biological simulations

simulations experiments

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!