Rainbow: a tool for large-scale whole-genome sequencing data analysis using cloud computing.

BMC Genomics

Systems Pharmacology and Biomarkers, Janssen Research & Development, LLC, 3210 Merryfield Row, San Diego, CA 92121, USA.

Published: June 2013

AI Article Synopsis

  • Technical advancements have made sequencing cheaper, allowing large datasets to be generated, but existing tools like Crossbow struggle with large-scale analyses and require significant resources.
  • Rainbow, a cloud-based software developed to automate large-scale whole-genome sequencing analyses, can process data from over 500 subjects in just two weeks for under $120 per sample, utilizing Amazon Web Services.
  • Rainbow improves upon Crossbow by supporting various input file types, optimizing data load management, logging processing metrics, and merging outputs for easier genome-wide association studies.

Article Abstract

Background: Technical improvements have decreased sequencing costs and, as a result, the size and number of genomic datasets have increased rapidly. Because of the lower cost, large amounts of sequence data are now being produced by small to midsize research groups. Crossbow is a software tool that can detect single nucleotide polymorphisms (SNPs) in whole-genome sequencing (WGS) data from a single subject; however, Crossbow has a number of limitations when applied to multiple subjects from large-scale WGS projects. The data storage and CPU resources that are required for large-scale whole genome sequencing data analyses are too large for many core facilities and individual laboratories to provide. To help meet these challenges, we have developed Rainbow, a cloud-based software package that can assist in the automation of large-scale WGS data analyses.

Results: Here, we evaluated the performance of Rainbow by analyzing 44 different whole-genome-sequenced subjects. Rainbow has the capacity to process genomic data from more than 500 subjects in two weeks using cloud computing provided by the Amazon Web Service. The time includes the import and export of the data using Amazon Import/Export service. The average cost of processing a single sample in the cloud was less than 120 US dollars. Compared with Crossbow, the main improvements incorporated into Rainbow include the ability: (1) to handle BAM as well as FASTQ input files; (2) to split large sequence files for better load balance downstream; (3) to log the running metrics in data processing and monitoring multiple Amazon Elastic Compute Cloud (EC2) instances; and (4) to merge SOAPsnp outputs for multiple individuals into a single file to facilitate downstream genome-wide association studies.

Conclusions: Rainbow is a scalable, cost-effective, and open-source tool for large-scale WGS data analysis. For human WGS data sequenced by either the Illumina HiSeq 2000 or HiSeq 2500 platforms, Rainbow can be used straight out of the box. Rainbow is available for third-party implementation and use, and can be downloaded from http://s3.amazonaws.com/jnj_rainbow/index.html.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3698007PMC
http://dx.doi.org/10.1186/1471-2164-14-425DOI Listing

Publication Analysis

Top Keywords

wgs data
16
large-scale wgs
12
data
11
rainbow
8
tool large-scale
8
whole-genome sequencing
8
sequencing data
8
data analysis
8
cloud computing
8
large-scale
5

Similar Publications

Introduction: Recent epidemiological data suggests a rising incidence of breast angiosarcoma (AS-B) in the Western population, with over two-thirds related to irradiation or chronic lymphedema. However, unlike head and neck angiosarcoma (AS-HN), AS-B disease characteristics in Asia remain unclear.

Methods: We examined clinical patterns of angiosarcoma patients (n = 176) seen in an Asiantertiary cancer center from 1999 to 2021, and specifically investigated the molecular and immune features of AS-B in comparison to AS-HN.

View Article and Find Full Text PDF

Background: Mixed infection with multiple strains of the same pathogen in a single host can present clinical and analytical challenges. Whole genome sequence (WGS) data can identify signals of multiple strains in samples, though the precision of previous methods can be improved. Here, we present MixInfect2, a new tool to accurately detect mixed samples from Mycobacterium tuberculosis short-read WGS data.

View Article and Find Full Text PDF

Significance of KLK7 expression, polymorphisms, and function in sheep horn growth.

BMC Genomics

January 2025

State Key Laboratory of Animal Biotech Breeding, Institute of Animal Science, Chinese Academy of Agricultural Sciences (CAAS), Beijing, 100193, China.

Background: Sheep horns play a critical role in the survival and reproduction of sheep. Research on sheep horns not only aids in comprehending their biological roles but is also vital for developing hornless breeds. Although previous studies have suggested that KLK7 may be associated with keratin growth, there are few studies that have focused on the role of KLK7 in sheep horns.

View Article and Find Full Text PDF

Introduction: The methicillin-resistant Staphylococcus aureus (MRSA) genome varies by geographical location. This study aims to determine the genomic characteristics of MRSA using whole-genome sequencing (WGS) data from medical centers in Mexico and to explore the associations between antimicrobial resistance genes and virulence factors.

Methods: This study included 27 clinical isolates collected from sterile sites at eight centers in Mexico in 2022 and 2023.

View Article and Find Full Text PDF

Evaluation of nationwide analysis surveillance for methicillin-resistant within Genomic Medicine Sweden.

Microb Genom

January 2025

Department of Laboratory Medicine, Clinical Microbiology, Faculty of Medicine and Health, rebro University, rebro, Sweden.

National epidemiological investigations of microbial infections greatly benefit from the increased information gained by whole-genome sequencing (WGS) in combination with standardized approaches for data sharing and analysis. To evaluate the quality and accuracy of WGS data generated by different laboratories but analysed by joint pipelines to reach a national surveillance approach. A national methicillin-resistant (MRSA) collection of 20 strains was distributed to nine participating laboratories that performed in-house procedures for WGS.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!