Critical assessment of on-premise approaches to scalable genome analysis.

BMC Bioinformatics

Department of Electrical Engineering and Computer Science, College of Engineering, Khalifa University, P.O. Box 127788, Abu Dhabi, United Arab Emirates.

Published: September 2023

AI Article Synopsis

  • Advances in DNA sequencing have drastically reduced costs and increased the scale and complexity of genomics research, enhancing the ability to predict disease traits from genetic data.
  • A comparison of genomic data science tools (BCFtools, SnpSift, Hail, GEMINI, OpenCGA) was conducted, focusing on their efficiency in handling data confidentiality and various performance metrics, such as storage and query speed.
  • Tools that use advanced data structures are better suited for large-scale genomics projects, while simpler tools can be effective for smaller ones, informing the development of scalable infrastructure in the field.

Article Abstract

Background: Plummeting DNA sequencing cost in recent years has enabled genome sequencing projects to scale up by several orders of magnitude, which is transforming genomics into a highly data-intensive field of research. This development provides the much needed statistical power required for genotype-phenotype predictions in complex diseases.

Methods: In order to efficiently leverage the wealth of information, we here assessed several genomic data science tools. The rationale to focus on on-premise installations is to cope with situations where data confidentiality and compliance regulations etc. rule out cloud based solutions. We established a comprehensive qualitative and quantitative comparison between BCFtools, SnpSift, Hail, GEMINI, and OpenCGA. The tools were compared in terms of data storage technology, query speed, scalability, annotation, data manipulation, visualization, data output representation, and availability.

Results: Tools that leverage sophisticated data structures are noted as the most suitable for large-scale projects in varying degrees of scalability in comparison to flat-file manipulation (e.g., BCFtools, and SnpSift). Remarkably, for small to mid-size projects, even lightweight relational database.

Conclusion: The assessment criteria provide insights into the typical questions posed in scalable genomics and serve as guidance for the development of scalable computational infrastructure in genomics.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC10512525PMC
http://dx.doi.org/10.1186/s12859-023-05470-2DOI Listing

Publication Analysis

Top Keywords

bcftools snpsift
8
data
6
critical assessment
4
assessment on-premise
4
on-premise approaches
4
approaches scalable
4
scalable genome
4
genome analysis
4
analysis background
4
background plummeting
4

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!