Coding information is the main source of heterogeneity (non-randomness) in the sequences of microbial genomes. The heterogeneity corresponds to a cluster structure in triplet distributions of relatively short genomic fragments (200-400 bp). We found a universal 7-cluster structure in microbial genomic sequences and explained its properties.
View Article and Find Full Text PDFIn several recent papers new gene-detection algorithms were proposed for detecting protein-coding regions without requiring a learning dataset of already known genes. The fact that unsupervised gene-detection is possible is closely connected to the existence of a cluster structure in oligomer frequency distributions. In this paper we study the cluster structure of several genomes in the space of their triplet frequencies, using a pure data exploration strategy.
View Article and Find Full Text PDF