With the increasing throughput of modern sequencing instruments, the cost of storing and transmitting sequencing data has also increased dramatically. Although many tools have been developed to compress sequencing data, there is still a need to develop a compressor with a higher compression ratio. We present a two-step framework for compressing sequencing data in this paper. The first step is to repack original data into a binary stream, while the second step is to compress the stream with a LZMA encoder. We develop a new strategy to encode the original file into a LZMA highly compressed stream. In addition an FPGA-accelerated of LZMA was implemented to speedup the second step. As a demonstration, we present repaq as a lossless non-reference compressor of FASTQ format files. We introduced a multifile redundancy elimination method, which is very useful for compressing paired-end sequencing data. According to our test results, the compression ratio of repaq is much higher than other FASTQ compressors. For some deep sequencing data, the compression ratio of repaq can be higher than 25, almost four times of Gzip. The framework presented in this paper can also be applied to develop new tools for compressing other sequencing data. The open-source code of repaq is available at: https://github.com/OpenGene/repaq.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC10552150PMC
http://dx.doi.org/10.3389/fgene.2023.1260531DOI Listing

Publication Analysis

Top Keywords

sequencing data
28
compression ratio
12
data
8
data compression
8
two-step framework
8
compressing sequencing
8
second step
8
ratio repaq
8
repaq higher
8
sequencing
7

Similar Publications

is widely used as a starter culture in the production of cheese, yoghurt and various cultured dairy products, which holds considerable significance in both research and practical applications within the food industry. Throughout history, the taxonomy of has undergone several adjustments and revisions. In 1984, based on the result of DNA-DNA hybridization, was reclassified as subsp.

View Article and Find Full Text PDF

Identification of circadian rhythm-related biomarkers and development of diagnostic models for Crohn's disease using machine learning algorithms.

Comput Methods Biomech Biomed Engin

January 2025

Department of Gastroenterolgy, The Second Affiliated Hospital of Chengdu Medical College, China National Nuclear Corporation 416 Hospital, Chengdu, China.

The global rise in Crohn's Disease (CD) incidence has intensified diagnostic challenges. This study identified circadian rhythm-related biomarkers for CD using datasets from the GEO database. Differentially expressed genes underwent Weighted Gene Co-Expression Network Analysis, with 49 hub genes intersected from GeneCards data.

View Article and Find Full Text PDF

Purpose: Deciding whether to provide preventive treatment to contacts of individuals with multidrug-resistant (MDR) tuberculosis is complex.

Methods: We present the diagnostic pathways, clinical course and outcome of tuberculosis treatment in eight siblings from a single family. Tuberculosis disease was diagnosed by Mycobacterium tuberculosis culture and molecular detection of M.

View Article and Find Full Text PDF

Objective: Osteoarthritis (OA) represents a condition under the influence of central nervous system (CNS) regulatory mechanisms. This investigation aims to examine the causal association between viral infections of the central nervous system (VICNS) and inflammatory diseases of the central nervous system (IDCNS) and knee osteoarthritis (KOA) at the genetic level.

Methods: In this investigation, VICNS and IDCNS were considered as primary exposure variables, while KOA served as the primary outcome.

View Article and Find Full Text PDF

Complementary Strategies to Identify Differentially Expressed Genes in the Choroid Plexus of Patients with Progressive Multiple Sclerosis.

Neuroinformatics

January 2025

Laboratory for Applied Genomics and Bioinnovations, Instituto Oswaldo Cruz - Fiocruz, Rio de Janeiro, RJ, Brazil.

Multiple sclerosis (MS) is a neurological disease causing myelin and axon damage through inflammatory and autoimmune processes. Despite affecting millions worldwide, understanding its genetic pathways remains limited. The choroid plexus (ChP) has been studied in neurodegenerative processes and diseases like MS due to its dysregulation, yet its role in MS pathophysiology remains unclear.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!