PASTASpark: multiple sequence alignment meets Big Data.

Bioinformatics

CiTIUS, Universidade de Santiago de Compostela, 15782 Santiago de Compostela, Spain.

Published: September 2017

Motivation: One basic step in many bioinformatics analyses is the multiple sequence alignment. One of the state-of-the-art tools to perform multiple sequence alignment is PASTA (Practical Alignments using SATé and TrAnsitivity). PASTA supports multithreading but it is limited to process datasets on shared memory systems. In this work we introduce PASTASpark, a tool that uses the Big Data engine Apache Spark to boost the performance of the alignment phase of PASTA, which is the most expensive task in terms of time consumption.

Results: Speedups up to 10×  with respect to single-threaded PASTA were observed, which allows to process an ultra-large dataset of 200 000 sequences within the 24-h limit.

Availability And Implementation: PASTASpark is an Open Source tool available at https://github.com/citiususc/pastaspark.

Contact: josemanuel.abuin@usc.es.

Supplementary Information: Supplementary data are available at Bioinformatics online.

Download full-text PDF

Source
http://dx.doi.org/10.1093/bioinformatics/btx354DOI Listing

Publication Analysis

Top Keywords

multiple sequence
12
sequence alignment
12
big data
8
pastaspark multiple
4
alignment
4
alignment meets
4
meets big
4
data motivation
4
motivation basic
4
basic step
4

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!