SeqPig: simple and scalable scripting for large sequencing data sets in Hadoop.

Bioinformatics

Aalto University School of Science and Helsinki Institute for Information Technology HIIT, Finland, International Computer Science Institute, Berkeley, CA, USA, CRS4-Center for Advanced Studies, Research and Development in Sardinia, Italy and CSC-IT Center for Science, Finland.

Published: January 2014

Summary: Hadoop MapReduce-based approaches have become increasingly popular due to their scalability in processing large sequencing datasets. However, as these methods typically require in-depth expertise in Hadoop and Java, they are still out of reach of many bioinformaticians. To solve this problem, we have created SeqPig, a library and a collection of tools to manipulate, analyze and query sequencing datasets in a scalable and simple manner. SeqPigscripts use the Hadoop-based distributed scripting engine Apache Pig, which automatically parallelizes and distributes data processing tasks. We demonstrate SeqPig's scalability over many computing nodes and illustrate its use with example scripts.

Availability And Implementation: Available under the open source MIT license at http://sourceforge.net/projects/seqpig/

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3866557PMC
http://dx.doi.org/10.1093/bioinformatics/btt601DOI Listing

Publication Analysis

Top Keywords

large sequencing
8
sequencing datasets
8
seqpig simple
4
simple scalable
4
scalable scripting
4
scripting large
4
sequencing data
4
data sets
4
sets hadoop
4
hadoop summary
4

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!