Publications by authors named "Paul Medvedev"

An annotation is a set of genomic intervals sharing a particular function or property. Examples include genes or their exons, sequence repeats, regions with a particular epigenetic state, and copy number variants. A common task is to compare two annotations to determine if one is enriched or depleted in the regions covered by the other.

View Article and Find Full Text PDF
Exact Sketch-Based Read Mapping.

Lebniz Int Proc Inform

September 2023

Given a sequencing read, the broad goal of read mapping is to find the location(s) in the reference genome that have a "similar sequence". Traditionally, "similar sequence" was defined as having a high alignment score and read mappers were viewed as heuristic solutions to this well-defined problem. For sketch-based mappers, however, there has not been a problem formulation to capture what problem an exact sketch-based mapping algorithm should solve.

View Article and Find Full Text PDF
Article Synopsis
  • Apes have two sex chromosomes: the essential Y chromosome for male reproduction and the X chromosome necessary for both reproduction and cognition, with differences in mating patterns affecting their function.
  • Studying these chromosomes is challenging due to their repetitive structures, but researchers created gapless assemblies for five great apes and one lesser ape to explore their evolutionary complexities.
  • The Y chromosomes are highly variable and undergo significant changes compared to the more stable X chromosomes, and this research can provide insights into human evolution and aid in the conservation of endangered ape species.
View Article and Find Full Text PDF

A colored de Bruijn graph (also called a set of k-mer sets), is a set of k-mers with every k-mer assigned a set of colors. Colored de Bruijn graphs are used in a variety of applications, including variant calling, genome assembly, and database search. However, their size has posed a scalability challenge to algorithm developers and users.

View Article and Find Full Text PDF

A colored de Bruijn graph (also called a set of k-mer sets), is a set of k-mers with every k-mer assigned a set of colors. Colored de Bruijn graphs are used in a variety of applications, including variant calling, genome assembly, and database search. However, their size has posed a scalability challenge to algorithm developers and users.

View Article and Find Full Text PDF

Background: Given a sequencing read, the broad goal of read mapping is to find the location(s) in the reference genome that have a "similar sequence". Traditionally, "similar sequence" was defined as having a high alignment score and read mappers were viewed as heuristic solutions to this well-defined problem. For sketch-based mappers, however, there has not been a problem formulation to capture what problem an exact sketch-based mapping algorithm should solve.

View Article and Find Full Text PDF
Article Synopsis
  • Apes have two main sex chromosomes, X and Y, where Y is crucial for male reproduction and its deletions can lead to infertility, while X is important for both reproduction and brain function.
  • Recent advancements in genomic techniques helped researchers create complete structures of the X and Y chromosomes for multiple great ape species, allowing them to explore their evolutionary complexities.
  • Findings indicate that Y chromosomes are highly variable and undergo rapid changes due to unique genetic regions and transposable elements, while X chromosomes are more stable, highlighting differing evolutionary paths among great ape species.
View Article and Find Full Text PDF

An annotation is a set of genomic intervals sharing a particular function or property. Examples include genes or their exons, evolutionarily conserved elements, and regions with a particular epigenetic state. A common task is to compare two annotations to determine if one is enriched or depleted in the regions covered by the other.

View Article and Find Full Text PDF

Y chromosomal ampliconic genes (YAGs) are important for male fertility, as they encode proteins functioning in spermatogenesis. The variation in copy number and expression levels of these multicopy gene families has been studied in great apes; however, the diversity of splicing variants remains unexplored. Here, we deciphered the sequences of polyadenylated transcripts of all nine YAG families (BPY2, CDY, DAZ, HSFY, PRY, RBMY, TSPY, VCY, and XKRY) from testis samples of six great ape species (human, chimpanzee, bonobo, gorilla, Bornean orangutan, and Sumatran orangutan).

View Article and Find Full Text PDF

The human Y chromosome has been notoriously difficult to sequence and assemble because of its complex repeat structure that includes long palindromes, tandem repeats and segmental duplications. As a result, more than half of the Y chromosome is missing from the GRCh38 reference sequence and it remains the last human chromosome to be finished. Here, the Telomere-to-Telomere (T2T) consortium presents the complete 62,460,029-base-pair sequence of a human Y chromosome from the HG002 genome (T2T-Y) that corrects multiple errors in GRCh38-Y and adds over 30 million base pairs of sequence to the reference, showing the complete ampliconic structures of gene families TSPY, DAZ and RBMY; 41 additional protein-coding genes, mostly from the TSPY family; and an alternating pattern of human satellite 1 and 3 blocks in the heterochromatic Yq12 region.

View Article and Find Full Text PDF

DNA sequencing data continue to progress toward longer reads with increasingly lower sequencing error rates. We focus on the critical problem of mapping, or aligning, low-divergence sequences from long reads (e.g.

View Article and Find Full Text PDF

Y-chromosomal Ampliconic Genes (YAGs) are important for male fertility, as they encode proteins functioning in spermatogenesis. The variation in copy number and expression levels of these multicopy gene families has been recently studied in great apes, however, the diversity of splicing variants remains unexplored. Here we deciphered the sequences of polyadenylated transcripts of all nine YAG families (, , , , , , , , and ) from testis samples of six great ape species (human, chimpanzee, bonobo, gorilla, Bornean orangutan, and Sumatran orangutan).

View Article and Find Full Text PDF

Despite the long history of genome assembly research, there remains a large gap between the theoretical and practical work. There is practical software with little theoretical underpinning of accuracy on one hand and theoretical algorithms which have not been adopted in practice on the other. In this paper we attempt to bridge the gap between theory and practice by showing how the theoretical safe-and-complete framework can be integrated into existing assemblers in order to improve contiguity.

View Article and Find Full Text PDF

Summary: When indexing large collections of short-read sequencing data, a common operation that has now been implemented in several tools (Sequence Bloom Trees and variants, BIGSI) is to construct a collection of Bloom filters, one per sample. Each Bloom filter is used to represent a set of -mers which approximates the desired set of all the non-erroneous -mers present in the sample. However, this approximation is imperfect, especially in the case of metagenomics data.

View Article and Find Full Text PDF

Motivation: The third-generation DNA sequencing technologies, such as Nanopore Sequencing, can operate at very high speeds and produce longer reads, which in turn results in a challenge for the computational analysis of such massive data. is a software package for signal-level analysis of Oxford Nanopore sequencing data. Call-methylation module of can detect methylation based on Hidden Markov Model (HMM).

View Article and Find Full Text PDF

Sequencing errors continue to pose algorithmic challenges to methods working with sequencing data. One of the simplest and most prevalent techniques for ameliorating the detrimental effects of homopolymer expansion/contraction errors present in long reads is homopolymer compression. It collapses runs of repeated nucleotides, to remove some sequencing errors and improve mapping sensitivity.

View Article and Find Full Text PDF
Article Synopsis
  • The Javan gibbon (Hylobates moloch) is an endangered species found only in western and central Java, Indonesia, making it one of the rarest gibbons.
  • Gibbons belong to the Hylobatidae family, which has four genera with varying chromosome counts from 38 to 52, though the reasons behind this variation are not well understood due to limited genomic data.
  • The study presents the first detailed genome assembly for H. moloch, utilizing various advanced sequencing techniques, providing a crucial resource for future comparative genomics research in primates.
View Article and Find Full Text PDF

Summary: Bioinformatics applications increasingly rely on ad hoc disk storage of k-mer sets, e.g. for de Bruijn graphs or alignment indexes.

View Article and Find Full Text PDF

Recent assemblies by the T2T and VGP consortia have achieved significant accuracy but required a tremendous amount of effort and resources. More typical assembly efforts, on the other hand, still suffer both from misassemblies (joining sequences that should not be adjacent) and from underassemblies (not joining sequences that should be adjacent). To better understand the common algorithm-driven causes of these limitations, we investigated the unitig algorithm, which is a core algorithm at the heart of most assemblers.

View Article and Find Full Text PDF

Motivation: Sketching is now widely used in bioinformatics to reduce data size and increase data processing speed. Sketching approaches entice with improved scalability but also carry the danger of decreased accuracy and added bias. In this article, we investigate the minimizer sketch and its use to estimate the Jaccard similarity between two sequences.

View Article and Find Full Text PDF

Motivation: Genome annotations are a common way to represent genomic features such as genes, regulatory elements or epigenetic modifications. The amount of overlap between two annotations is often used to ascertain if there is an underlying biological connection between them. In order to distinguish between true biological association and overlap by pure chance, a robust measure of significance is required.

View Article and Find Full Text PDF
Article Synopsis
  • -mer-based methods in bioinformatics are commonly used, but their statistical properties are not fully understood, especially regarding mutation processes in sequences.
  • The study derives the expectation and variance for mutated -mers, islands (intervals of mutated -mers), and oceans (intervals of nonmutated -mers), providing key statistical insights.
  • The findings include hypothesis tests and confidence intervals for analyzing mutated -mers, and they showcase practical applications such as improving estimates in Mash distance, enhancing read alignment with Minimap2, and evaluating long-read alignments with Jabba.
View Article and Find Full Text PDF

All vertebrate genomes have been colonized by retroviruses along their evolutionary trajectory. Although endogenous retroviruses (ERVs) can contribute important physiological functions to contemporary hosts, such benefits are attributed to long-term coevolution of ERV and host because germline infections are rare and expansion is slow, and because the host effectively silences them. The genomes of several outbred species including mule deer (Odocoileus hemionus) are currently being colonized by ERVs, which provides an opportunity to study ERV dynamics at a time when few are fixed.

View Article and Find Full Text PDF