SpaRC: scalable sequence clustering using Apache Spark.

Bioinformatics

US Department of Energy, Joint Genome Institute, Walnut Creek, CA, USA.

Published: March 2019

Motivation: Whole genome shotgun based next-generation transcriptomics and metagenomics studies often generate 100-1000 GB sequence data derived from tens of thousands of different genes or microbial species. Assembly of these data sets requires tradeoffs between scalability and accuracy. Current assembly methods optimized for scalability often sacrifice accuracy and vice versa. An ideal solution would both scale and produce optimal accuracy for individual genes or genomes.

Results: Here we describe an Apache Spark-based scalable sequence clustering application, SparkReadClust (SpaRC), that partitions reads based on their molecule of origin to enable downstream assembly optimization. SpaRC produces high clustering performance on transcriptomes and metagenomes from both short and long read sequencing technologies. It achieves near-linear scalability with input data size and number of compute nodes. SpaRC can run on both cloud computing and HPC environments without modification while delivering similar performance. Our results demonstrate that SpaRC provides a scalable solution for clustering billions of reads from next-generation sequencing experiments, and Apache Spark represents a cost-effective solution with rapid development/deployment cycles for similar large-scale sequence data analysis problems.

Availability And Implementation: https://bitbucket.org/berkeleylab/jgi-sparc.

Download full-text PDF

Source
http://dx.doi.org/10.1093/bioinformatics/bty733DOI Listing

Publication Analysis

Top Keywords

sparc scalable
8
scalable sequence
8
sequence clustering
8
apache spark
8
sequence data
8
sparc
5
sequence
4
clustering
4
clustering apache
4
spark motivation
4

Similar Publications

The macrodomain contained in the SARS-CoV-2 non-structural protein 3 (NSP3) is required for viral pathogenesis and lethality. Inhibitors that block the macrodomain could be a new therapeutic strategy for viral suppression. We previously performed a large-scale X-ray crystallography-based fragment screen and discovered a sub-micromolar inhibitor by fragment linking.

View Article and Find Full Text PDF

Background: Within England, children and young people (CYP) who come into police custody are referred to Liaison and Diversion (L&D) teams. L&D teams have responsibility for liaising with healthcare and other support services while working to divert CYP away from the criminal justice system but have traditionally not provided targeted psychological interventions to CYP. Considering evidence that Solution Focused Brief Therapy (SFBT) leads to a reduction in internalising and externalising behaviour problems in CYP, the aim of this randomised controlled trial (RCT) was to determine whether there is a difference between services as usual (SAU) plus SFBT offered by trained therapists working within a L&D team, and SAU alone, in reducing offending behaviours in 10-17-year-olds presenting at police custody.

View Article and Find Full Text PDF

Chronic pain associated with osteoarthritis (OA) remains an intractable problem with few effective treatment options. New approaches are needed to model the disease biology and to drive discovery of therapeutics. We present an in vitro model of OA pain, where dorsal root ganglion (DRG) sensory neurons were sensitized by a defined mixture of disease-relevant inflammatory mediators, here called Sensitizing PAin Reagent Composition or SPARC.

View Article and Find Full Text PDF

Emerging pollutants and a large volume of unused dyes from the textile industry have been contaminating water bodies. This work introduces a scalable approach to purifying water by the adsorption of Acid green 25 (AG), Crystal Violet (CV), and Sulfamethoxazole (SMA) from an aqueous solution by graphene oxide (GO) doped modified silica aerogel (GO-SA) with supercritical fluid deposition (SFD) method. Characterization of GO-SA using X-ray diffraction (XRD), Fourier-transform infrared spectroscopy (FTIR), high-resolution scanning electron microscopy (HR-SEM), thermogravimetric analysis (TGA), and Brunauer-Emmett-Teller (BET) adsorption isotherms revealed the improvement in the adsorbent surface area, and its textural properties.

View Article and Find Full Text PDF

Zebrafish exhibit robust regeneration following spinal cord injury, promoted by macrophages that control post-injury inflammation. However, the mechanistic basis of how macrophages regulate regeneration is poorly understood. To address this gap in understanding, we conducted a rapid in vivo phenotypic screen for macrophage-related genes that promote regeneration after spinal injury.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!