Clustering-Based Compression for Population DNA Sequences.

IEEE/ACM Trans Comput Biol Bioinform

Published: August 2019

Due to the advancement of DNA sequencing techniques, the number of sequenced individual genomes has experienced an exponential growth. Thus, effective compression of this kind of sequences is highly desired. In this work, we present a novel compression algorithm called Reference-based Compression algorithm using the concept of Clustering (RCC). The rationale behind RCC is based on the observation about the existence of substructures within the population sequences. To utilize these substructures, k-means clustering is employed to partition sequences into clusters for better compression. A reference sequence is then constructed for each cluster so that sequences in that cluster can be compressed by referring to this reference sequence. The reference sequence of each cluster is also compressed with reference to a sequence which is derived from all the reference sequences. Experiments show that RCC can further reduce the compressed size by up to 91.0 percent when compared with state-of-the-art compression approaches. There is a compromise between compressed size and processing time. The current implementation in Matlab has time complexity in a factor of thousands higher than the existing algorithms implemented in C/C++. Further investigation is required to improve processing time in future.

Download full-text PDF

Source
http://dx.doi.org/10.1109/TCBB.2017.2762302DOI Listing

Publication Analysis

Top Keywords

reference sequence
16
compression algorithm
8
cluster compressed
8
compressed size
8
processing time
8
sequences
6
compression
5
reference
5
clustering-based compression
4
compression population
4

Similar Publications

This study aimed to identify shared gene expression related to circadian rhythm disruption in polycystic ovary syndrome (PCOS) and non-alcoholic fatty liver disease (NAFLD) to discover common diagnostic biomarkers. Visceral fat RNA samples were collected from 12 PCOS and 14 non-PCOS patients, a sample size representing the clinical situation and sufficient to capture PCOS gene expression profiles. Along with liver transcriptome profiles from NAFLD patients, these data were analyzed to identify crosstalk circadian rhythm-related genes (CRRGs) between the diseases.

View Article and Find Full Text PDF

Interleukin-10 (IL-10) is an immunomodulatory molecule that may play an immunosuppressive role in nonmelanoma skin cancer (NMSC), specifically basal cell carcinoma (BCC). We analyzed the role of IL10 promoter variants in genetic determinants of BCC susceptibility and their association with IL10 mRNA and IL-10 serum levels. Three promoter variants (- 1082 A > G, - 819 T > C, and - 592 A > C) were examined in 250 BCC patients and 250 reference group (RG) individuals.

View Article and Find Full Text PDF

In recent years, there has been a global threat from emerging vector-borne diseases (VBD), despite the implementation of several vector control programs. Considering the benefits of bacterial pesticides, the present study aimed to isolate potential mosquitocidal bacteria from the various soil types collected from the Kasaragod (12.5°N, 75.

View Article and Find Full Text PDF

Robust discrimination between closely related species of salmon based on DNA fragments.

Anal Bioanal Chem

January 2025

Statistical Engineering Division, National Institute of Standards and Technology, 100 Bureau Drive, Gaithersburg, MD, 20899-8980, USA.

Closely related species of Salmonidae, including Pacific and Atlantic salmon, can be distinguished from one another based on nucleotide sequences from the cytochrome c oxidase sub-unit 1 mitochondrial gene (COI), using ensembles of fragments aligned to genetic barcodes that serve as digital proxies for the relevant species. This is accomplished by exploiting both the nucleotide sequences and their quality scores recorded in a FASTQ file obtained via Next Generation (NextGen) Sequencing of mitochondrial DNA extracted from Coho salmon caught with hook and line in the Gulf of Alaska. The alignment is done using MUSCLE (Muscle 5.

View Article and Find Full Text PDF

The interplay of sex and genotype in disease associations: a comprehensive network analysis in the UK Biobank.

Hum Genomics

January 2025

Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, University of Pennsylvania, Richards Building B304, 3700 Hamilton Walk, Philadelphia, PA, 19104, USA.

Background: Disease comorbidities and longer-term complications, arising from biologically related associations across phenotypes, can lead to increased risk of severe health outcomes. Given that many diseases exhibit sex-specific differences in their genetics, our objective was to determine whether genotype-by-sex (GxS) interactions similarly influence cross-phenotype associations. Through comparison of sex-stratified disease-disease networks (DDNs)-where nodes represent diseases and edges represent their relationships-we investigate sex differences in patterns of polygenicity and pleiotropy between diseases.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!