SAMQA: error classification and validation of high-throughput sequenced read data.

Thomas Robinson Sarah Killcoyne Ryan Bressler John Boyle

BMC Genomics

Institute for Systems Biology, 401 Terry Ave N, Seattle, WA 98109, USA.

Published: August 2011

Background: The advances in high-throughput sequencing technologies and growth in data sizes has highlighted the need for scalable tools to perform quality assurance testing. These tests are necessary to ensure that data is of a minimum necessary standard for use in downstream analysis. In this paper we present the SAMQA tool to rapidly and robustly identify errors in population-scale sequence data.

Results: SAMQA has been used on samples from three separate sets of cancer genome data from The Cancer Genome Atlas (TCGA) project. Using technical standards provided by the SAM specification and biological standards defined by researchers, we have classified errors in these sequence data sets relative to individual reads within a sample. Due to an observed linearithmic speedup through the use of a high-performance computing (HPC) framework for the majority of tasks, poor quality data was identified prior to secondary analysis in significantly less time on the HPC framework than the same data run using alternative parallelization strategies on a single server.

Conclusions: The SAMQA toolset validates a minimum set of data quality standards across whole-genome and exome sequences. It is tuned to run on a high-performance computational framework, enabling QA across hundreds gigabytes of samples regardless of coverage or sample type.

Download full-text PDF	Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3170309	PMC
http://dx.doi.org/10.1186/1471-2164-12-419	DOI Listing

Publication Analysis

Top Keywords

data

cancer genome

hpc framework

samqa

samqa error

error classification

classification validation

validation high-throughput

high-throughput sequenced

sequenced read

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!