SeQual-Stream: approaching stream processing to quality control of NGS datasets.

BMC Bioinformatics

Universidade da Coruña, CITIC, Computer Architecture Group, Campus de Elviña, 15071, A Coruña, Spain.

Published: October 2023

Background: Quality control of DNA sequences is an important data preprocessing step in many genomic analyses. However, all existing parallel tools for this purpose are based on a batch processing model, needing to have the complete genetic dataset before processing can even begin. This limitation clearly hinders quality control performance in those scenarios where the dataset must be downloaded from a remote repository and/or copied to a distributed file system for its parallel processing.

Results: In this paper we present SeQual-Stream, a streaming tool that allows performing multiple quality control operations on genomic datasets in a fast, distributed and scalable way. To do so, our approach relies on the Apache Spark framework and the Hadoop Distributed File System (HDFS) to fully exploit the stream paradigm and accelerate the preprocessing of large datasets as they are being downloaded and/or copied to HDFS. The experimental results have shown significant improvements in the execution times of SeQual-Stream when compared to a batch processing tool with similar quality control features, providing a maximum speedup of 2.7[Formula: see text] when processing a dataset with more than 250 million DNA sequences, while also demonstrating good scalability features.

Conclusion: Our solution provides a more scalable and higher performance way to carry out quality control of large genomic datasets by taking advantage of stream processing features. The tool is distributed as free open-source software released under the GNU AGPLv3 license and is publicly available to download at https://github.com/UDC-GAC/SeQual-Stream .

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC10612204PMC
http://dx.doi.org/10.1186/s12859-023-05530-7DOI Listing

Publication Analysis

Top Keywords

quality control
24
stream processing
8
dna sequences
8
batch processing
8
and/or copied
8
distributed file
8
file system
8
genomic datasets
8
processing
6
quality
6

Similar Publications

Background: This study was designed to determine the effects of acceptance and commitment care in the treatment of aplastic anemia (AA) patients with recombinant human thrombopoietin (rhTPO).

Methods: The clinical records of 100 AA patients treated at our hospital from March 2021 to March 2023 were analyzed in the retrospective study. All patients received immunosuppressants and rhTPO.

View Article and Find Full Text PDF

Background/objectives: Congenital rubella syndrome (CRS) is a constellation of serious multi-organ birth defects following rubella virus infection during early pregnancy. Countries in which rubella vaccination has not yet been introduced can have a high burden of this disease. Data on CRS burden and epidemiology are needed to guide the introduction of a rubella vaccine and monitor progress for rubella elimination, but the multi-system nature of CRS manifestations and required specialized testing creates a challenge for conducting CRS surveillance in developing settings such as Sudan.

View Article and Find Full Text PDF

Background: The Vero cell rabies vaccine is currently the most widely used human rabies vaccine. However, owing to the presence of residual host cell DNA (HCD) in the final product and the potential tumorigenicity of the DNA of high-passage Vero cells, the WHO not only sets a limit on the number of times cells used in production can be passaged, but also imposes strict requirements on the amount of residual HCD in the final vaccine product.

Objectives: To systematically reduce the HCD level in the final vaccine product, multiple purification steps are included in the vaccine production process.

View Article and Find Full Text PDF

Protective Efficacy of an Inactivated Recombinant Serotype 4 Fowl Adenovirus Against Duck Adenovirus 3 in Muscovy Duck.

Vaccines (Basel)

November 2024

Key Laboratory of Jiangsu Preventive Veterinary Medicine, Key Laboratory for Avian Preventive Medicine, Ministry of Education, College of Veterinary Medicine, Yangzhou University, Yangzhou 225009, China.

Background: Duck adenovirus 3 (DAdV-3) is an emerging pathogen that has caused severe economic losses to the duck industry in China. Recently, the infection of ducks with serotype 4 fowl adenovirus (FAdV-4) has also been reported in China. Therefore, an efficient bivalent vaccine to control the diseases caused by DAdV-3 and FAdV-4 is extremely urgent.

View Article and Find Full Text PDF

UAV Trajectory Control and Power Optimization for Low-Latency C-V2X Communications in a Federated Learning Environment.

Sensors (Basel)

December 2024

Department of Electrical, Computer and Biomedical Engineering, Toronto Metropolitan University, Toronto, ON M5B2K3, Canada.

Unmanned aerial vehicle (UAV)-enabled vehicular communications in the sixth generation (6G) are characterized by line-of-sight (LoS) and dynamically varying channel conditions. However, the presence of obstacles in the LoS path leads to shadowed fading environments. In UAV-assisted cellular vehicle-to-everything (C-V2X) communication, vehicle and UAV mobility and shadowing adversely impact latency and throughput.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!