FRESCO: Referential compression of highly similar sequences.

IEEE/ACM Trans Comput Biol Bioinform

Published: August 2014

In many applications, sets of similar texts or sequences are of high importance. Prominent examples are revision histories of documents or genomic sequences. Modern high-throughput sequencing technologies are able to generate DNA sequences at an ever-increasing rate. In parallel to the decreasing experimental time and cost necessary to produce DNA sequences, computational requirements for analysis and storage of the sequences are steeply increasing. Compression is a key technology to deal with this challenge. Recently, referential compression schemes, storing only the differences between a to-be-compressed input and a known reference sequence, gained a lot of interest in this field. In this paper, we propose a general open-source framework to compress large amounts of biological sequence data called Framework for REferential Sequence COmpression (FRESCO). Our basic compression algorithm is shown to be one to two orders of magnitudes faster than comparable related work, while achieving similar compression ratios. We also propose several techniques to further increase compression ratios, while still retaining the advantage in speed: 1) selecting a good reference sequence; and 2) rewriting a reference sequence to allow for better compression. In addition,we propose a new way of further boosting the compression ratios by applying referential compression to already referentially compressed files (second-order compression). This technique allows for compression ratios way beyond state of the art, for instance,4,000:1 and higher for human genomes. We evaluate our algorithms on a large data set from three different species (more than 1,000 genomes, more than 3 TB) and on a collection of versions of Wikipedia pages. Our results show that real-time compression of highly similar sequences at high compression ratios is possible on modern hardware.

Download full-text PDF

Source
http://dx.doi.org/10.1109/tcbb.2013.122DOI Listing

Publication Analysis

Top Keywords

compression ratios
20
compression
14
referential compression
12
reference sequence
12
compression highly
8
highly sequences
8
sequences high
8
dna sequences
8
sequences
7
sequence
5

Similar Publications

Middle ear barotrauma (MEBT) is the most common complication in providing hyperbaric oxygen therapy (HBO). This study explored the impact of altering the shape of the time-pressure curve with the aim of reducing the occurrence of MEBT and optimizing the HBO experience during the pressurization process. Four distinct mathematically derived protocols-Constant Pressure Difference (CPD), Constant Volume Difference (CVD), Constant Ratio (CR), and Inverted Constant Ratio (ICR)-were investigated using computer simulations on a simple ear model.

View Article and Find Full Text PDF

This work is focused on the impact of temperature and deformation on the mechanical properties, specifically the elastic modulus () of the amorphous regions in semicrystalline polymers, using polypropylene as a case study. It has been shown that increasing temperature results in an decrease due to the enhanced mobility of polymer chains, triggered by the activation of α relaxation processes within the crystalline component. Consequently, rising temperature reduces the "stiffening" effect of the crystalline regions on the interlamellar layers.

View Article and Find Full Text PDF

We propose and demonstrate a photonic compressive sensing (PCS) scheme for microwave signals using optical pulse random mixing, significantly enhancing both the compression ratio and operating frequency range. Unlike continuous-wave laser-based PCS systems, our approach mitigates the non-ideal characteristics of the pseudo-random binary sequence (PRBS), such as sloped edges and amplitude jitters, resulting in a more ideal compression process. Additionally, the high harmonic components of the optical pulses further facilitate wideband downconversion, improving the system's operating frequency range.

View Article and Find Full Text PDF

The synergistic utilization of multiple solid waste is an effective means of achieving green filling and resource utilization of solid waste in mines. In this paper, the synergistic effects of solid waste granulated blast furnace slag (GS) and carbide slag (CS) as cementitious materials (GCCM) are investigated, along with their preliminary feasibility in combination with coal gangue (CG) and furnace bottom slag (FBS) for the preparation of backfill materials. The synergistic hydration mechanism, mechanical properties, working performance of GCCM and GBC were studied, and the environmental impact and cost-effectiveness of GBC were evaluated.

View Article and Find Full Text PDF

Posttraining Network Compression for 3D Medical Image Segmentation: Reducing Computational Efforts via Tucker Decomposition.

Radiol Artif Intell

January 2025

From the Department of Radiology, University Hospital, LMU Munich, Marchioninistr 15,81377 Munich, Germany (T.W., J.D., M.I.); Department of Statistics, LMU Munich, Munich, Germany (T.W., D.R.); and Munich Center for Machine Learning, Munich, Germany (T.W., J.D., D.R., M.I.).

Purpose To investigate whether the computational effort of 3D CT-based multiorgan segmentation with TotalSegmentator can be reduced via Tucker decomposition-based network compression. Materials and Methods In this retrospective study, Tucker decomposition was applied to the convolutional kernels of the TotalSegmentator model, an nnU-Net model trained on a comprehensive CT dataset for automatic segmentation of 117 anatomic structures. The proposed approach reduced the floating-point operations (FLOPs) and memory required during inference, offering an adjustable trade-off between computational efficiency and segmentation quality.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!