We address the problem of discovering pairs of symmetric genomic words (i.e., words and the corresponding reversed complements) occurring at distances that are overrepresented.
View Article and Find Full Text PDFBaroreceptor reflex sensitivity (BRS) is an important prognostic factor because a reduced BRS has been associated with an adverse cardiovascular outcome. The threshold for a 'reduced' BRS was established by the ATRAMI study at BRS <3 ms/mmHg in patients with a previous myocardial infarction, and has been shown to improve risk assessment in many other cardiac dysfunctions. The successful application of this cut-off to other populations suggests that it may reflect an inherent property of baroreflex functioning, so our goal is to investigate whether it represents a 'natural' partition of BRS values.
View Article and Find Full Text PDFSpecies evolution is indirectly registered in their genomic structure. The emergence and advances in sequencing technology provided a way to access genome information, namely to identify and study evolutionary macro-events, as well as chromosome alterations for clinical purposes. This paper describes a completely alignment-free computational method, based on a blind unsupervised approach, to detect large-scale and small-scale genomic rearrangements between pairs of DNA sequences.
View Article and Find Full Text PDFMotivation: Ebola virus causes high mortality hemorrhagic fevers, with more than 25 000 cases and 10 000 deaths in the current outbreak. Only experimental therapies are available, thus, novel diagnosis tools and druggable targets are needed.
Results: Analysis of Ebola virus genomes from the current outbreak reveals the presence of short DNA sequences that appear nowhere in the human genome.
Data summarization and triage is one of the current top challenges in visual analytics. The goal is to let users visually inspect large data sets and examine or request data with particular characteristics. The need for summarization and visual analytics is also felt when dealing with digital representations of DNA sequences.
View Article and Find Full Text PDFPrevious studies have suggested that Chargaff's second rule may hold for relatively long words (above 10nucleotides), but this has not been conclusively shown. In particular, the following questions remain open: Is the phenomenon of symmetry statistically significant? If so, what is the word length above which significance is lost? Can deviations in symmetry due to the finite size of the data be identified? This work addresses these questions by studying word symmetries in the human genome, chromosomes and transcriptome. To rule out finite-length effects, the results are compared with those obtained from random control sequences built to satisfy Chargaff's second parity rule.
View Article and Find Full Text PDFWe study the inter-dinucleotide distance distributions in the human genome, both in the whole-genome and protein-coding regions. The inter-dinucleotide distance is defined as the distance to the next occurrence of the same dinucleotide. We consider the 16 sequences of inter-dinucleotide distances and two reading frames.
View Article and Find Full Text PDFA finite-context (Markov) model of order k yields the probability distribution of the next symbol in a sequence of symbols, given the recent past up to depth k. Markov modeling has long been applied to DNA sequences, for example to find gene-coding regions. With the first studies came the discovery that DNA sequences are non-stationary: distinct regions require distinct model orders.
View Article and Find Full Text PDFMinimal absent words have been computed in genomes of organisms from all domains of life. Here, we explore different sets of minimal absent words in the genomes of 22 organisms (one archaeota, thirteen bacteria and eight eukaryotes). We investigate if the mutational biases that may explain the deficit of the shortest absent words in vertebrates are also pervasive in other absent words, namely in minimal absent words, as well as to other organisms.
View Article and Find Full Text PDFDNA may be represented by sequences of four symbols, but it is often useful to convert those symbols into real or complex numbers for further analysis. Several mapping schemes have been used in the past, but most of them seem to be unrelated to any intrinsic characteristic of DNA. The objective of this work was to study a mapping scheme that is directly related to DNA characteristics, and that could be useful in discriminating between different species.
View Article and Find Full Text PDFMotivation: DNA sequences can be represented by sequences of four symbols, but it is often useful to convert the symbols into real or complex numbers for further analysis. Several mapping schemes have been used in the past, but they seem unrelated to any intrinsic characteristic of DNA. The objective of this work was to find a mapping scheme directly related to DNA characteristics and that would be useful in discriminating between different species.
View Article and Find Full Text PDFBackground: The problem of finding the shortest absent words in DNA data has been recently addressed, and algorithms for its solution have been described. It has been noted that longer absent words might also be of interest, but the existing algorithms only provide generic absent words by trivially extending the shortest ones.
Results: We show how absent words relate to the repetitions and structure of the data, and define a new and larger class of absent words, called minimal absent words, that still captures the essential properties of the shortest absent words introduced in recent works.
It is known that the protein-coding regions of DNA are usually characterized by a three-base periodicity. In this paper, we exploit this property, studying a DNA model based on three deterministic states, where each state implements a finite-context model. The experimental results obtained confirm the appropriateness of the proposed approach, showing compression gains in relation to the single finite-context model counterpart.
View Article and Find Full Text PDFPhys Rev E Stat Nonlin Soft Matter Phys
September 2004
This paper explores the connection between the size of the spectral coefficients of a nucleotide or any other symbolic sequence and the distribution of nucleotides along certain subsequences. It explains the connection between the nucleotide distribution and the size of the spectral coefficients, and gives a necessary and sufficient condition for a coefficient to have a prescribed magnitude. Furthermore, it gives a fast algorithm for computing the value of a given spectral coefficient of a nucleotide sequence, discussing periods 3 and 4 as examples.
View Article and Find Full Text PDF