AI Article Synopsis

Article Abstract

Background: Data from large Next Generation Sequencing (NGS) experiments present challenges both in terms of costs associated with storage and in time required for file transfer. It is sometimes possible to store only a summary relevant to particular applications, but generally it is desirable to keep all information needed to revisit experimental results in the future. Thus, the need for efficient lossless compression methods for NGS reads arises. It has been shown that NGS-specific compression schemes can improve results over generic compression methods, such as the Lempel-Ziv algorithm, Burrows-Wheeler transform, or Arithmetic Coding. When a reference genome is available, effective compression can be achieved by first aligning the reads to the reference genome, and then encoding each read using the alignment position combined with the differences in the read relative to the reference. These reference-based methods have been shown to compress better than reference-free schemes, but the alignment step they require demands several hours of CPU time on a typical dataset, whereas reference-free methods can usually compress in minutes.

Results: We present a new approach that achieves highly efficient compression by using a reference genome, but completely circumvents the need for alignment, affording a great reduction in the time needed to compress. In contrast to reference-based methods that first align reads to the genome, we hash all reads into Bloom filters to encode, and decode by querying the same Bloom filters using read-length subsequences of the reference genome. Further compression is achieved by using a cascade of such filters.

Conclusions: Our method, called BARCODE, runs an order of magnitude faster than reference-based methods, while compressing an order of magnitude better than reference-free methods, over a broad range of sequencing coverage. In high coverage (50-100 fold), compared to the best tested compressors, BARCODE saves 80-90% of the running time while only increasing space slightly.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4168706PMC
http://dx.doi.org/10.1186/1471-2105-15-S9-S7DOI Listing

Publication Analysis

Top Keywords

reference genome
16
bloom filters
12
reference-based methods
12
lossless compression
8
compression methods
8
compression achieved
8
methods compress
8
better reference-free
8
reference-free methods
8
order magnitude
8

Similar Publications

Hybrid strains of enterotoxigenic/Shiga toxin-producing , United Kingdom, 2014-2023.

J Med Microbiol

January 2025

NIHR Health Protection Research Unit in Gastrointestinal Infections, University of Liverpool, Liverpool, UK.

Diarrhoeagenic (DEC) pathotypes are defined by genes located on mobile genetic elements, and more than one definitive pathogenicity gene may be present in the same strain. In August 2022, UK Health Security Agency (UKHSA) surveillance systems detected an outbreak of hybrid Shiga toxin-producing /enterotoxigenic (STEC-ETEC) serotype O101:H33 harbouring both Shiga toxin () and heat-stable toxin (). These hybrid strains of DEC are a public health concern, as they are often associated with enhanced pathogenicity.

View Article and Find Full Text PDF

Introduction: Tat protein is a trans-activator of HIV-1 genome transcription, with additional functions including the ability to induce the chronic inflammatory process. Natural amino acid polymorphisms in Tat may affect its functional properties and the course of HIV infection. The aim of this work is to analyze the marks of Tat consensus sequences in non-A6 HIV-1 variants characteristic of the Russian Federation, as well as study natural polymorphisms in Tat CRF63_02A6 and subtype B variants circulating in Russia.

View Article and Find Full Text PDF

Haplotypes of Chloroquine Resistance Marker Genes Among Uncomplicated Malaria Cases in Lagos, Nigeria.

Biochem Genet

January 2025

Key Laboratory of Parasite and Vector Biology of the Chinese Ministry of Health, Chinese Center for Disease Control and Prevention, WHO Collaborating Centre for Tropical Diseases, National Institute of Parasitic Diseases, Shanghai, 200025, People's Republic of China.

Drug resistance resulting from mutations in Plasmodium falciparum, that caused the failure of previously effective malaria drugs, has continued to threaten the global malaria elimination goal. This study describes the profiles of P. falciparum chloroquine resistance transporter (Pfcrt) and P.

View Article and Find Full Text PDF

Dengue is one of the most prevalent viruses transmitted by the Aedes aegypti mosquitoes. Currently, no specific medication is available to treat dengue diseases. The NS2B-NS3 protease is vital during post-translational processing, which is a key target in this study.

View Article and Find Full Text PDF

In July 2022, a genetically linked and geographically dispersed cluster of 12 cases of Shiga toxin-producing (STEC) O103:H2 was detected by the UK Health Security Agency using whole genome sequencing. Review of food history questionnaires identified cheese (particularly an unpasteurized brie-style cheese) and mixed salad leaves as potential vehicles. A case-control study was conducted to investigate exposure to these products.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!