Pangenomes are growing in number and size, thanks to the prevalence of high-quality long-read assemblies. However, current methods for studying sequence composition and conservation within pangenomes have limitations. Methods based on graph pangenomes require a computationally expensive multiple-alignment step, which can leave out some variation. Indexes based on k-mers and de Bruijn graphs are limited to answering questions at a specific substring length k. We present Maximal Exact Match Ordered (MEMO), a pangenome indexing method based on maximal exact matches (MEMs) between sequences. A single MEMO index can handle arbitrary-length queries over pangenomic windows. MEMO enables both queries that test k-mer presence/absence (membership queries) and that count the number of genomes containing k-mers in a window (conservation queries). MEMO's index for a pangenome of 89 human autosomal haplotypes fits in 2.04 GB, 8.8 smaller than a comparable KMC3 index and 11.4 smaller than a PanKmer index. MEMO indexes can be made smaller by sacrificing some counting resolution, with our decile-resolution HPRC index reaching 0.67 GB. MEMO can conduct a conservation query for 31-mers over the human leukocyte antigen locus in 13.89 s, 2.5 faster than other approaches. MEMO's small index size, lack of k-mer length dependence, and efficient queries make it a flexible tool for studying and visualizing substring conservation in pangenomes.
Download full-text PDF |
Source |
---|---|
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC11871630 | PMC |
http://dx.doi.org/10.1186/s13015-025-00272-y | DOI Listing |
Metagenomic read classification is a fundamental task in computational biology, yet it remains challenging due to the scale, diversity, and complexity of sequencing datasets. We propose a novel, lossless, run-length compressed index that enables efficient multi-class metagenomic classification in ( ) space, based on the move structure. Our method identifies all super-maximal exact matches (SMEMs) of length at least between a read and the reference dataset and associates each SMEM with one class identifier using a sampled tag array.
View Article and Find Full Text PDFAlgorithms Mol Biol
March 2025
Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA.
Pangenomes are growing in number and size, thanks to the prevalence of high-quality long-read assemblies. However, current methods for studying sequence composition and conservation within pangenomes have limitations. Methods based on graph pangenomes require a computationally expensive multiple-alignment step, which can leave out some variation.
View Article and Find Full Text PDFbioRxiv
February 2025
Department of Computer Science, University of Maryland, MD 20742, USA.
The rapid growth of genomic data over the past decade has made scalable and efficient sequence analysis algorithms, particularly for constructing de Bruijn graphs and their colored and compacted variants critical components of many bioinformatics pipelines. Colored compacted de Bruijn graphs condense repetitive sequence information, significantly reducing the data burden on downstream analyses like assembly, indexing, and pan-genomics. However, direct construction of these graphs is necessary as constructing the original uncompacted graph is essentially infeasible at large scale.
View Article and Find Full Text PDFbioRxiv
February 2025
Department of Computer Science, University of Central Florida, Orlando, FL, USA.
Traditionally, variations from a linear reference genome were used to represent large sets of haplotypes compactly. In the linear reference genome based paradigm, the positional Burrows-Wheeler transform (PBWT) has traditionally been used to perform efficient haplotype matching. Pangenome graphs have recently been proposed as an alternative to linear reference genomes for representing the full spectrum of variations in the human genome.
View Article and Find Full Text PDFFront Immunol
February 2025
Shanghai Veterinary Research Institute, Chinese Academy of Agricultural Sciences, Shanghai, China.
The widespread chronic enteritis known as Paratuberculosis (PTB) or Johne's disease (JD) is caused by subspecies (MAP), posing a significant threat to global public health. Given the challenges associated with PTB or JD, the development and application of vaccines are potentially important for disease control. The aim of this study was to design a multi-epitope vaccine against MAP.
View Article and Find Full Text PDFEnter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!