Publications by authors named "Korotkov E"

In this study, we applied the iterative procedure (IP) method to search for families of highly diverged dispersed repeats in the genome of , which contains over 16 million bases. The algorithm included the construction of position weight matrices (PWMs) for repeat families and the identification of more dispersed repeats based on the PWMs using dynamic programming. The results showed that the genome contained 20 repeat families comprising a total of 33,938 dispersed repeats, which is significantly more than has been previously found using other methods.

View Article and Find Full Text PDF

The exact identification of promoter sequences remains a serious problem in computational biology, as the promoter prediction algorithms under development continue to produce false-positive results. Therefore, to fully assess the validity of predicted sequences, it is necessary to perform a comprehensive test of their properties, such as the presence of downstream transcribed DNA regions behind them, or chromatin accessibility for transcription factor binding. In this paper, we examined the promoter sequences of chromosome 1 of the rice genome from the Database of Potential Promoter Sequences predicted using a mathematical algorithm based on the derivation and calculation of statistically significant promoter classes.

View Article and Find Full Text PDF

We have developed a new method for promoter sequence classification based on a genetic algorithm and the MAHDS sequence alignment method. We have created four classes of human promoters, combining 17,310 sequences out of the 29,598 present in the EPD database. We searched the human genome for potential promoter sequences (PPSs) using dynamic programming and position weight matrices representing each of the promoter sequence classes.

View Article and Find Full Text PDF

We have developed a de novo method for the identification of dispersed repeats based on the use of random position-weight matrices (PWMs) and an iterative procedure (IP). The created algorithm (IP method) allows detection of dispersed repeats for which the average number of substitutions between any two repeats per nucleotide () is less than or equal to 1.5.

View Article and Find Full Text PDF

In this study, we modified the multiple alignment method based on the generation of random position weight matrices (RPWM) and used it to search for tandem repeats (TRs) in the Capsicum annuum genome. The application of the modified (m)RPWM method, which considers the correlation of adjusting nucleotides, resulted in the identification of 908,072 TR regions with repeat lengths from 2 to 200 bp in the C. annuum genome, where they occupied ~29%.

View Article and Find Full Text PDF

In this study, we used a mathematical method for the multiple alignment of highly divergent sequences (MAHDS) to create a database of potential promoter sequences (PPSs) in the genome. To search for PPSs, 20 statistically significant classes of sequences located in the range from -499 to +100 nucleotides near the annotated genes were calculated. For each class, a position-weight matrix (PWM) was computed and then used to identify PPSs in the genome.

View Article and Find Full Text PDF

In this paper, we attempted to find a relation between bacteria living conditions and their genome algorithmic complexity. We developed a probabilistic mathematical method for the evaluation of k-words (6 bases length) occurrence irregularity in bacterial gene coding sequences. For this, the coding sequences from different bacterial genomes were analyzed and as an index of k-words occurrence irregularity, we used W, which has a distribution similar to normal.

View Article and Find Full Text PDF

The aim of this work was to compare the multiple alignment methods MAHDS, T-Coffee, MUSCLE, Clustal Omega, Kalign, MAFFT, and PRANK in their ability to align highly divergent amino acid sequences. To accomplish this, we created test amino acid sequences with an average number of substitutions per amino acid (x) from 0.6 to 5.

View Article and Find Full Text PDF

We report a Method to Search for Highly Divergent Tandem Repeats (MSHDTR) in protein sequences which considers pairwise correlations between adjacent residues. MSHDTR was compared with some previously developed methods for searching for tandem repeats (TRs) in amino acid sequences, such as T-REKS and XSTREAM, which focus on the identification of TRs with significant sequence similarity, whereas MSHDTR detects repeats that significantly diverged during evolution, accumulating deletions, insertions, and substitutions. The application of MSHDTR to a search of the Swiss-Prot databank revealed over 15 thousand TR-containing amino acid sequences that were difficult to find using the other methods.

View Article and Find Full Text PDF

Currently, there is a lack of bioinformatics approaches to identify highly divergent tandem repeats (TRs) in eukaryotic genomes. Here, we developed a new mathematical method to search for TRs, which uses a novel algorithm for constructing multiple alignments based on the generation of random position weight matrices (RPWMs), and applied it to detect TRs of 2 to 50 nucleotides long in the rice genome. The RPWM method could find highly divergent TRs in the presence of insertions or deletions.

View Article and Find Full Text PDF

In Russia and around the world, there are important questions regarding the potential threats to national and biological safety created by genetic technologies and the need to improve or introduce new, justified, and adequate measures for their control, regulation, and prevention. The article shows that a significant volume of the global market is occupied by five major transgenic crops, and producers are ready to switch to crops with an edited genome that has been approved in the United States, Argentina, and other countries. We propose a qualitatively new approach to the risk assessment of edited plants, "Safe Design," and we have also developed an extremely important, fundamentally new approach to the development of methods that combine next-generation sequencing (NGS) and Bioinformatics for the assessment of the crop import biosafety.

View Article and Find Full Text PDF
Article Synopsis
  • Transposable elements (TEs), specifically Short Interspersed Nuclear Elements (SINEs), play a major role in eukaryotic genomes and are challenging to identify due to rapid mutations after insertion.
  • The Highly Divergent Repeat Search Method (HDRSM) outperformed the RepeatMasker program in identifying and accurately determining the boundaries of highly divergent SINE copies in the rice genome, revealing 14,030 hits – with 5,704 missed by RepeatMasker.
  • To achieve a complete understanding of SINE distribution, using both HDRSM and RepeatMasker is advised, as HDRSM excels in detecting divergent copies while RepeatMasker is more effective for shorter, more similar copies.
View Article and Find Full Text PDF

In this study, we developed a new mathematical method for performing multiple alignment of highly divergent sequences (MAHDS), i.e., sequences that have on average more than 2.

View Article and Find Full Text PDF

A new mathematical method for potential reading frameshift detection in protein-coding sequences (cds) was developed. The algorithm is adjusted to the triplet periodicity of each analysed sequence using dynamic programming and a genetic algorithm. This does not require any preliminary training.

View Article and Find Full Text PDF

A new mathematical method was used for the first time to search for tandem repeats with insertions and deletions in the full-length sequence of the A. thaliana genome. The method is based on a new algorithm for multiple alignment of sequences of certain periods without using paired comparisons of sequences.

View Article and Find Full Text PDF

. We analyzed several prokaryotic and eukaryotic genomes looking for the periodicity sequences availability and employing a new mathematical method. The method envisaged using the random position weight matrices and dynamic programming.

View Article and Find Full Text PDF

The aim of this study was to show that amino acid sequences have a latent periodicity with insertions and deletions of amino acids in unknown positions of the analyzed sequence. Genetic algorithm, dynamic programming and random weight matrices were used to develop a new mathematical algorithm for latent periodicity search. A multiple alignment of periods was calculated with help of the direct optimization of the position-weight matrix without using pairwise alignments.

View Article and Find Full Text PDF

A mathematical method was developed in order to search for latent periodicity in protein amino acid and other symbolical sequences using the dynamic programming and random matrixes. The method permits detection of the latent periodicity with insertions and deletions in the previously unknown positions. The developed method was applied to search for the periodicity in the amino acid sequences of some proteins and the periodicity in EUR/USD exchange rate since 2001.

View Article and Find Full Text PDF

Over the last years a great number of bacterial genomes were sequenced. Now one of the most important challenges of computational genomics is the functional annotation of nucleic acid sequences. In this study we presented the computational method and the annotation system for predicting biological functions using phylogenetic profiles.

View Article and Find Full Text PDF

It is known that nucleotide sequences are not totally homogeneous and this heterogeneity could not be due to random fluctuations only. Such heterogeneity poses a problem of making sequence segmentation into a set of homogeneous parts divided by the points called "change points". In this work we investigated a special case of change points-paired change points (PCP).

View Article and Find Full Text PDF

Triplet periodicity (TP) is a distinctive feature of the protein coding sequences of both prokaryotic and eukaryotic genomes. In this work, we explored the TP difference inside and between 45 prokaryotic genomes. We constructed two hypotheses of TP distribution on a set of coding sequences and generated artificial datasets that correspond to the hypotheses.

View Article and Find Full Text PDF

To determine the periodicity of a DNA sequence, different spectral approaches are applied (discrete Fourier transform (DFT), autocorrelation (CORR), information decomposition (ID), hybrid method (HYB), concept of spectral envelope for spectral analysis (SE), normalized autocorrelation (CORR_N) and profile analysis (PA). In this work, we investigated the possibility of finding the true period length, by depending on the average number of accumulated changes in DNA bases (PM) for the methods stated above. The results show that for periods with short length (≤4 b.

View Article and Find Full Text PDF

We describe a new mathematical method for finding very diverged short tandem repeats containing a single indel. The method involves comparison of two frequency matrices: a first matrix for a subsequence before shift and a second one for a subsequence after it. A measure of comparison is based on matrix similarity.

View Article and Find Full Text PDF

Unlabelled: A web server for searching latent periodicity based on the method of modified profile analysis has been developed. This method allows searching latent periodicity in presence of insertions and deletions. During searching process, the periodicity classes are used which were found by us earlier for various groups of organisms.

View Article and Find Full Text PDF