Gene prediction has remained an active area of bioinformatics research for a long time. Still, gene prediction in large eukaryotic genomes presents a challenge that must be addressed by new algorithms. The amount and significance of the evidence available from transcriptomes and proteomes vary across genomes, between genes, and even along a single gene.
View Article and Find Full Text PDFLarge-scale genomic initiatives, such as the Earth BioGenome Project, require efficient methods for eukaryotic genome annotation. Here we present an automatic gene finder, GeneMark-ETP, integrating genomic-, transcriptomic-, and protein-derived evidence that has been developed with a focus on large plant and animal genomes. GeneMark-ETP first identifies genomic loci where extrinsic data are sufficient for making gene predictions with "high confidence.
View Article and Find Full Text PDFGene prediction has remained an active area of bioinformatics research for a long time. Still, gene prediction in large eukaryotic genomes presents a challenge that must be addressed by new algorithms. The amount and significance of the evidence available from transcriptomes and proteomes vary across genomes, between genes and even along a single gene.
View Article and Find Full Text PDFMassive sequencing of microbiomes has led to the discovery of a large number of phage genomes with intermittent stop codon recoding. We have developed a computational tool, MgCod, that identifies genomic regions (blocks) with distinct stop codon recoding simultaneously with the prediction of protein-coding regions. When MgCod was used to scan a large volume of human metagenomic contigs hundreds of viral contigs with intermittent stop codon recoding were revealed.
View Article and Find Full Text PDFLarge-scale genomic initiatives, such as the Earth BioGenome Project, require efficient methods for eukaryotic genome annotation. Here we present an automatic gene finder, GeneMark-ETP, integrating genomic-, transcriptomic- and protein-derived evidence that has been developed with a focus on large plant and animal genomes. GeneMark-ETP first identifies genomic loci where extrinsic data is sufficient for making gene predictions with 'high confidence'.
View Article and Find Full Text PDFBlackberries (Rubus spp.) are the fourth most economically important berry crop worldwide. Genome assemblies and annotations have been developed for Rubus species in subgenus Idaeobatus, including black raspberry (R.
View Article and Find Full Text PDFFront Bioinform
December 2021
State-of-the-art algorithms of gene prediction for prokaryotic genomes were shown to be sufficiently accurate. A pair of algorithms would agree on predictions of gene 3'ends. Nonetheless, predictions of gene starts would not match for 15-25% of genes in a genome.
View Article and Find Full Text PDFComputational reconstruction of nearly complete genomes from metagenomic reads may identify thousands of new uncultured candidate bacterial species. We have shown that reconstructed prokaryotic genomes along with genomes of sequenced microbial isolates can be used to support more accurate gene prediction in novel sequences. We have proposed an approach that used three types of gene prediction algorithms and found for all contigs in a metagenome nearly optimal models of protein-coding regions either in libraries of pre-computed models or constructed de novo.
View Article and Find Full Text PDFThe task of eukaryotic genome annotation remains challenging. Only a few genomes could serve as standards of annotation achieved through a tremendous investment of human curation efforts. Still, the correctness of all alternative isoforms, even in the best-annotated genomes, could be a good subject for further investigation.
View Article and Find Full Text PDFWe recently developed a test based on the Agilent SureSelect target enrichment system capturing genomic fragments from 191 human papillomaviruses (HPV) types for Illumina sequencing. This enriched whole genome sequencing (eWGS) assay provides an approach to identify all HPV types in a sample. Here we present a machine learning algorithm that calls HPV types based on the eWGS output.
View Article and Find Full Text PDFWe have made several steps toward creating a fast and accurate algorithm for gene prediction in eukaryotic genomes. First, we introduced an automated method for efficient gene finding, GeneMark-ES, with parameters trained in iterative mode. Next, in GeneMark-ET we proposed a method of integration of unsupervised training with information on intron positions revealed by mapping short RNA reads.
View Article and Find Full Text PDFCyclospora cayetanensis is an intestinal parasite responsible for the diarrheal illness, cyclosporiasis. Molecular genotyping, using targeted amplicon sequencing, provides a complementary tool for outbreak investigations, especially when epidemiological data are insufficient for linking cases and identifying clusters. The goal of this study was to identify candidate genotyping markers using a novel workflow for detection of segregating single nucleotide polymorphisms (SNPs) in C.
View Article and Find Full Text PDFBRAKER is a pipeline for highly accurate and fully automated gene prediction in novel eukaryotic genomes. It combines two major tools: GeneMark-ES/ET and AUGUSTUS. GeneMark-ES/ET learns its parameters from a novel genomic sequence in a fully automated fashion; if available, it uses extrinsic evidence for model refinement.
View Article and Find Full Text PDFIn a conventional view of the prokaryotic genome organization, promoters precede operons and ribosome binding sites (RBSs) with Shine-Dalgarno consensus precede genes. However, recent experimental research suggesting a more diverse view motivated us to develop an algorithm with improved gene-finding accuracy. We describe GeneMarkS-2, an ab initio algorithm that uses a model derived by self-training for finding species-specific (native) genes, along with an array of precomputed "heuristic" models designed to identify harder-to-detect genes (likely horizontally transferred).
View Article and Find Full Text PDFBackground: The genus Potentilla is closely related to that of Fragaria, the economically important strawberry genus. Potentilla micrantha is a species that does not develop berries but shares numerous morphological and ecological characteristics with Fragaria vesca. These similarities make P.
View Article and Find Full Text PDFRecent technological advances have opened unprecedented opportunities for large-scale sequencing and analysis of populations of pathogenic species in disease outbreaks, as well as for large-scale diversity studies aimed at expanding our knowledge across the whole domain of prokaryotes. To meet the challenge of timely interpretation of structure, function and meaning of this vast genetic information, a comprehensive approach to automatic genome annotation is critically needed. In collaboration with Georgia Tech, NCBI has developed a new approach to genome annotation that combines alignment based methods with methods of predicting protein-coding and RNA genes and other functional elements directly from sequence.
View Article and Find Full Text PDFMotivation: Gene finding in eukaryotic genomes is notoriously difficult to automate. The task is to design a work flow with a minimal set of tools that would reach state-of-the-art performance across a wide range of species. GeneMark-ET is a gene prediction tool that incorporates RNA-Seq data into unsupervised training and subsequently generates ab initio gene predictions.
View Article and Find Full Text PDFMassive parallel sequencing of RNA transcripts by next-generation technology (RNA-Seq) generates critically important data for eukaryotic gene discovery. Gene finding in transcripts can be done by statistical (alignment-free) as well as by alignment-based methods. We describe a new tool, GeneMarkS-T, for ab initio identification of protein-coding regions in RNA transcripts.
View Article and Find Full Text PDFWe present a new approach to automatic training of a eukaryotic ab initio gene finding algorithm. With the advent of Next-Generation Sequencing, automatic training has become paramount, allowing genome annotation pipelines to keep pace with the speed of genome sequencing. Earlier we developed GeneMark-ES, currently the only gene finding algorithm for eukaryotic genomes that performs automatic training in unsupervised ab initio mode.
View Article and Find Full Text PDFCultivated citrus are selections from, or hybrids of, wild progenitor species whose identities and contributions to citrus domestication remain controversial. Here we sequence and compare citrus genomes--a high-quality reference haploid clementine genome and mandarin, pummelo, sweet-orange and sour-orange genomes--and show that cultivated types derive from two progenitor species. Although cultivated pummelos represent selections from one progenitor species, Citrus maxima, cultivated mandarins are introgressions of C.
View Article and Find Full Text PDFBackground: Little is known about the mechanisms of adaptation of life to the extreme environmental conditions encountered in polar regions. Here we present the genome sequence of a unicellular green alga from the division chlorophyta, Coccomyxa subellipsoidea C-169, which we will hereafter refer to as C-169. This is the first eukaryotic microorganism from a polar environment to have its genome sequenced.
View Article and Find Full Text PDFCucumber (Cucumis sativus L.), a widely cultivated crop, has originated from Eastern Himalayas and secondary domestication regions includes highly divergent climate conditions e.g.
View Article and Find Full Text PDFThe mushroom Coprinopsis cinerea is a classic experimental model for multicellular development in fungi because it grows on defined media, completes its life cycle in 2 weeks, produces some 10(8) synchronized meiocytes, and can be manipulated at all stages in development by mutation and transformation. The 37-megabase genome of C. cinerea was sequenced and assembled into 13 chromosomes.
View Article and Find Full Text PDFWe describe an algorithm for gene identification in DNA sequences derived from shotgun sequencing of microbial communities. Accurate ab initio gene prediction in a short nucleotide sequence of anonymous origin is hampered by uncertainty in model parameters. While several machine learning approaches could be proposed to bypass this difficulty, one effective method is to estimate parameters from dependencies, formed in evolution, between frequencies of oligonucleotides in protein-coding regions and genome nucleotide composition.
View Article and Find Full Text PDFWe have developed a new method for frameshift detection, a combination of ab initio and alignment-based algorithms, that can serve as a useful tool for sequencing quality control in the next generation sequencing. We evaluated the method's accuracy on test sets of annotated genomic sequences with artificial frameshifts in protein coding regions. These tests have shown that the new method performs comparably to the earlier developed FrameD.
View Article and Find Full Text PDF