Background: De novo genome assembly is a technique that builds the genome of a specimen using overlaps of genomic fragments without additional work with reference sequence. Sequence fragments (called reads) are assembled as contigs and scaffolds by the overlaps. The quality of the de novo assembly depends on the length and continuity of the assembly. To enable faster and more accurate assembly of species, existing sequencing techniques have been proposed, for example, high-throughput next-generation sequencing and long-reads-producing third-generation sequencing. However, these techniques require a large amounts of computer memory when very huge-size overlap graphs are resolved. Also, it is challenging for parallel computation.

Results: To address the limitations, we propose an innovative algorithmic approach, called Scalable Overlap-graph Reduction Algorithms (SORA). SORA is an algorithm package that performs string graph reduction algorithms by Apache Spark. The SORA's implementations are designed to execute de novo genome assembly on either a single machine or a distributed computing platform. SORA efficiently compacts the number of edges on enormous graphing paths by adapting scalable features of graph processing libraries provided by Apache Spark, GraphX and GraphFrames.

Conclusions: We shared the algorithms and the experimental results at our project website, https://github.com/BioHPC/SORA . We evaluated SORA with the human genome samples. First, it processed a nearly one billion edge graph on a distributed cloud cluster. Second, it processed mid-to-small size graphs on a single workstation within a short time frame. Overall, SORA achieved the linear-scaling simulations for the increased computing instances.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6805285PMC
http://dx.doi.org/10.1186/s40246-019-0227-1DOI Listing

Publication Analysis

Top Keywords

apache spark
12
genome assembly
12
scalable overlap-graph
8
overlap-graph reduction
8
novo genome
8
sequencing techniques
8
reduction algorithms
8
assembly
6
genome
5
sora
5

Similar Publications

SparkDWM: a scalable design of a Data Washing Machine using Apache Spark.

Front Big Data

September 2024

Department of Information Sciences, University of Arkansas at Little Rock, Little Rock, AR, United States.

Data volume has been one of the fast-growing assets of most real-world applications. This increases the rate of human errors such as duplication of records, misspellings, and erroneous transpositions, among other data quality issues. Entity Resolution is an ETL process that aims to resolve data inconsistencies by ensuring entities are referring to the same real-world objects.

View Article and Find Full Text PDF

Elevating Smart Manufacturing with a Unified Predictive Maintenance Platform: The Synergy between Data Warehousing, Apache Spark, and Machine Learning.

Sensors (Basel)

June 2024

Department of Industrial Engineering and Engineering Management, Yuan Ze University, 135, Far-East Rd., Taoyuan 320315, Taiwan.

The transition to smart manufacturing introduces heightened complexity in regard to the machinery and equipment used within modern collaborative manufacturing landscapes, presenting significant risks associated with equipment failures. The core ambition of smart manufacturing is to elevate automation through the integration of state-of-the-art technologies, including artificial intelligence (AI), the Internet of Things (IoT), machine-to-machine (M2M) communication, cloud technology, and expansive big data analytics. This technological evolution underscores the necessity for advanced predictive maintenance strategies that proactively detect equipment anomalies before they escalate into costly downtime.

View Article and Find Full Text PDF

With the advance of smart manufacturing and information technologies, the volume of data to process is increasing accordingly. Current solutions for big data processing resort to distributed stream processing systems, such as Apache Flink and Spark. However, such frameworks face challenges of resource underutilization and high latency in big data application scenarios.

View Article and Find Full Text PDF

A distributed data processing scheme based on Hadoop for synchrotron radiation experiments.

J Synchrotron Radiat

May 2024

The Institute for Advanced Studies, Wuhan University, Wuhan 430072, People's Republic of China.

With the development of synchrotron radiation sources and high-frame-rate detectors, the amount of experimental data collected at synchrotron radiation beamlines has increased exponentially. As a result, data processing for synchrotron radiation experiments has entered the era of big data. It is becoming increasingly important for beamlines to have the capability to process large-scale data in parallel to keep up with the rapid growth of data.

View Article and Find Full Text PDF

An Optimized IoT-enabled Big Data Analytics Architecture for Edge-Cloud Computing.

IEEE Internet Things J

March 2023

Department of Information Science, College of Computer and Information Systems, Umm Al-Qura University, Makkah, Saudi Arabia.

The awareness of edge computing is attaining eminence and is largely acknowledged with the rise of Internet of Things (IoT). Edge-enabled solutions offer efficient computing and control at the network edge to resolve the scalability and latency-related concerns. Though, it comes to be challenging for edge computing to tackle diverse applications of IoT as they produce massive heterogeneous data.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!