Publications by authors named "Alexander Bolshoy"

The existence of multiple copies of genes is a well-known phenomenon. A gene family is a set of sufficiently similar genes, formed by gene duplication. In earlier works conducted on a limited number of completely sequenced and annotated genomes it was found that size of gene family and size of genome are positively correlated.

View Article and Find Full Text PDF

Background: The length of a protein sequence is largely determined by its function. In certain species, it may be also affected by additional factors, such as growth temperature or acidity. In 2002, it was shown that in the bacterium Escherichia coli and in the archaeon Archaeoglobus fulgidus, protein sequences with no homologs were, on average, shorter than those with homologs (BMC Evol Biol 2:20, 2002).

View Article and Find Full Text PDF

Proteins of the same functional family (for example, kinases) may have significantly different lengths. It is an open question whether such variation in length is random or it appears as a response to some unknown evolutionary driving factors. The main purpose of this paper is to demonstrate existence of factors affecting prokaryotic gene lengths.

View Article and Find Full Text PDF

Ancestral sequence reconstruction is a well-known problem in molecular evolution. The problem presented in this study is inspired by sequence reconstruction, but instead of leaf-associated sequences we consider only their lengths. We call this problem ancestral gene length reconstruction.

View Article and Find Full Text PDF

In this paper we present a novel method for genome ranking according to gene lengths. The main outcomes described in this paper are the following: the formulation of the genome ranking problem, presentation of relevant approaches to solve it, and the demonstration of preliminary results from prokaryotic genomes ordering. Using a subset of prokaryotic genomes, we attempted to uncover factors affecting gene length.

View Article and Find Full Text PDF

In this paper, we propose a method to classify prokaryotic genomes using the agglomerative information bottleneck method for unsupervised clustering. Although the method we present here is closely related to a group of methods based on detecting the presence or absence of genes, our method is different because it uses gene lengths as well. We show that this amended method is reliable.

View Article and Find Full Text PDF

Parvovirus B19 has an extreme tropism for human erythroid progenitors. Here we propose the hypothesis explaining the tropism of human parvovirus B19. Our speculations are based on experimental results related to the capsid proteins VP1 and VP2.

View Article and Find Full Text PDF

The advancement in Escherichia coli genome research has made the information regarding transcription start sites of many genes available. A study relying on the availability of transcription start locations was performed. The first question addressed was what an average DNA curvature profile upstream of genes would look like when these genes are aligned by transcription start sites in comparison to alignment by translation start sites.

View Article and Find Full Text PDF

'Evolution Canyon' (ECI) at Lower Nahal Oren, Mount Carmel, Israel, is an optimal natural microscale model for unravelling evolution in action highlighting the twin evolutionary processes of adaptation and speciation. A major model organism in ECI is wild barley, Hordeum spontaneum, the progenitor of cultivated barley, which displays dramatic interslope adaptive and speciational divergence on the 'African' dry slope (AS) and the 'European' humid slope (ES), separated on average by 200 m. Here we examined interslope single nucleotide polymorphism (SNP) sequences and the expression diversity of the drought resistant dehydrin 1 gene (Dhn1) between the opposite slopes.

View Article and Find Full Text PDF

Background: Given a big sequence fragment or a set of functionally related sequences we consider two problems of a sequence analysis associated with the given sequence(s). The first problem is to measure sequence complexity (repetitiveness, compactness) to estimate how informative the set as a whole is. Usually an obtained measure should be compared with an appropriate random background calculated using permutation of the given sequences.

View Article and Find Full Text PDF

It is known that DNA curvature plays a certain role in gene regulation. The distribution of curved DNA in promoter regions is evolutionarily preserved, and it is mainly determined by temperature of habitat. However, very little is known on the distribution of DNA curvature in termination sites.

View Article and Find Full Text PDF

With the availability of genome sequences, the possibility of new phylogenetic reconstructions arises in order to reveal genomic relationships among organisms. According to the compositional-spectra (CS) approach proposed in our previous studies, any genomic sequence can be characterized by a distribution of frequencies of imperfect matching of words (oligonucleotides). In the current application of CS-analysis, we attempted to analyze the cluster structure of genomes across life.

View Article and Find Full Text PDF

DNA curvature is known to play a biological role in gene regulation, in particular, initiation of transcription. We applied the software CURVATURE based on the wedge model to predict whether promoter regions of certain prokaryotes may be characterized by higher intrinsic DNA curvature located within or upstream to these regions. The main purpose was to verify our earlier hypothesis that the DNA curvature plays a biological role in gene regulation in mesophilic as compared to hyperthermophilic prokaryotes, i.

View Article and Find Full Text PDF

The phenomenon of overlapping of various sequence messages in genomes is a puzzle for evolutionary theoreticians, geneticists, and sequence researchers. The overlapping is possible due to degeneracy of the messages, in particular, degeneracy of codons. It is often observed in organisms with a limited size of genome, possessing polymerases of low fidelity.

View Article and Find Full Text PDF

The centromere sequence parC of Escherichia coli low-copy-number plasmid R1 consists of two sets of 11 bp iterated sequences. Here we analysed the intrinsic sequence-directed curvature of parC by its migration anomaly in polyacrylamide gels. The 159 bp long parC is strongly curved with anomaly values (k-factors) close to 2.

View Article and Find Full Text PDF

Background: Sequence periodicity with a period close to the DNA helical repeat is a very basic genomic property. This genomic feature was demonstrated for many prokaryotic genomes. The Escherichia coli sequences display the period close to 11 base pairs.

View Article and Find Full Text PDF

This is a review of the methods based on counting oligomers in nucleotide and amino acid sequences. Such methods are analogous to the formal linguistic analysis of human texts. This review includes methods based on the calculation of observed occurrences (frequencies) of oligomers and their distribution, as well as those based on deviations between the observed and the expected occurrences (contrast words, genome signatures) in biological sequences.

View Article and Find Full Text PDF

Retroviruses and LTR retrotransposons comprise two long-terminal repeats (LTRs) bounding a central domain that encodes the products needed for reverse transcription, packaging, and integration into the genome. We describe a group of retrotransposons in 13 species and four genera of the grass tribe Triticeae, including barley, with long, approximately 4.4-kb LTRs formerly called Sukkula elements.

View Article and Find Full Text PDF

We introduce a novel, linguistic-like method of genome analysis. We propose a natural approach to characterizing genomic sequences based on occurrences of fixed length words from a predefined, sufficiently large set of words (strings over the alphabet [A, C, G, T]). A measure based on this approach is called compositional spectrum and is actually a histogram of imperfect word occurrences.

View Article and Find Full Text PDF

The coexistence of multiple codes in the genome of human immunodeficiency virus type 1 (HIV-1) was analyzed. We explored factors constraining the variability of the virus genome primarily in relation to conserved RNA secondary structures overlapping coding sequences, and used a simple combination of algorithms for RNA secondary structure prediction based on the nearest-neighbor thermodynamic rules and a statistical approach. In our previous study, we applied this combination to a non- redundant data set of env nucleotide sequences, confirmed the conservative secondary structure of the rev-responsive element (RRE) and found a new RNA structure in the first conserved (C1) region of the env gene.

View Article and Find Full Text PDF

We have analyzed amino acid, nucleotide sequence, and RNA secondary structure variability in the env gene of human immunodeficiency virus type (HIV-1). In applying algorithms for computing optimal RNA-folding patterns to a nonredundant data set of 178 env nucleotide sequences, we found a conserved RNA stem-loop structure in the first conserved (C1) region of the env gene. This detailed examination also revealed the known secondary structure conservation of the Rev-responsive element (RRE).

View Article and Find Full Text PDF

Motivation: One of the major features of genomic DNA sequences, distinguishing them from texts in most spoken or artificial languages, is their high repetitiveness. Variation in the repetitiveness of genomic texts reflects the presence and density of different biologically important messages. Thus, deviation from an expected number of repeats in both directions indicates a possible presence of a biological signal.

View Article and Find Full Text PDF