Mem-based pangenome indexing for k-mer queries.

Stephen Hwang Nathaniel K Brown Omar Y Ahmed Katharine M Jenike Sam Kovaka Michael C Schatz Ben Langmead

Algorithms Mol Biol

Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA.

Published: March 2025

Pangenomes are growing in number and size, thanks to the prevalence of high-quality long-read assemblies. However, current methods for studying sequence composition and conservation within pangenomes have limitations. Methods based on graph pangenomes require a computationally expensive multiple-alignment step, which can leave out some variation. Indexes based on k-mers and de Bruijn graphs are limited to answering questions at a specific substring length k. We present Maximal Exact Match Ordered (MEMO), a pangenome indexing method based on maximal exact matches (MEMs) between sequences. A single MEMO index can handle arbitrary-length queries over pangenomic windows. MEMO enables both queries that test k-mer presence/absence (membership queries) and that count the number of genomes containing k-mers in a window (conservation queries). MEMO's index for a pangenome of 89 human autosomal haplotypes fits in 2.04 GB, 8.8 smaller than a comparable KMC3 index and 11.4 smaller than a PanKmer index. MEMO indexes can be made smaller by sacrificing some counting resolution, with our decile-resolution HPRC index reaching 0.67 GB. MEMO can conduct a conservation query for 31-mers over the human leukocyte antigen locus in 13.89 s, 2.5 faster than other approaches. MEMO's small index size, lack of k-mer length dependence, and efficient queries make it a flexible tool for studying and visualizing substring conservation in pangenomes.

Download full-text PDF	Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC11871630	PMC
http://dx.doi.org/10.1186/s13015-025-00272-y	DOI Listing

Publication Analysis

Top Keywords

pangenome indexing

conservation pangenomes

maximal exact

queries

memo

mem-based pangenome

indexing k-mer

k-mer queries

pangenomes

queries pangenomes

Similar Publications

Run-length compressed metagenomic read classification with SMEM-finding and tagging.

bioRxiv

February 2025

Lore Depuydt Omar Y Ahmed Jan Fostier Ben Langmead Travis Gagie

Metagenomic read classification is a fundamental task in computational biology, yet it remains challenging due to the scale, diversity, and complexity of sequencing datasets. We propose a novel, lossless, run-length compressed index that enables efficient multi-class metagenomic classification in ( ) space, based on the move structure. Our method identifies all super-maximal exact matches (SMEMs) of length at least between a read and the reference dataset and associates each SMEM with one class identifier using a sampled tag array.

View Article and Find Full Text PDF

Similar Publications

Mem-based pangenome indexing for k-mer queries.

Algorithms Mol Biol

March 2025

Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA.

Stephen Hwang Nathaniel K Brown Omar Y Ahmed Katharine M Jenike Sam Kovaka

View Article and Find Full Text PDF

Similar Publications

Fast and Scalable Parallel External-Memory Construction of Colored Compacted de Bruijn Graphs with Cuttlefish 3.

bioRxiv

February 2025

Department of Computer Science, University of Maryland, MD 20742, USA.

Jamshed Khan Laxman Dhulipala Rob Patro

The rapid growth of genomic data over the past decade has made scalable and efficient sequence analysis algorithms, particularly for constructing de Bruijn graphs and their colored and compacted variants critical components of many bioinformatics pipelines. Colored compacted de Bruijn graphs condense repetitive sequence information, significantly reducing the data burden on downstream analyses like assembly, indexing, and pan-genomics. However, direct construction of these graphs is necessary as constructing the original uncompacted graph is essentially infeasible at large scale.

View Article and Find Full Text PDF

Similar Publications

Haplotype Matching with GBWT for Pangenome Graphs.

bioRxiv

February 2025

Department of Computer Science, University of Central Florida, Orlando, FL, USA.

Ahsan Sanaullah Seba Villalobos Degui Zhi Shaojie Zhang

Traditionally, variations from a linear reference genome were used to represent large sets of haplotypes compactly. In the linear reference genome based paradigm, the positional Burrows-Wheeler transform (PBWT) has traditionally been used to perform efficient haplotype matching. Pangenome graphs have recently been proposed as an alternative to linear reference genomes for representing the full spectrum of variations in the human genome.

View Article and Find Full Text PDF

Similar Publications

design of a multi-epitope vaccine against subspecies .

Front Immunol

February 2025

Shanghai Veterinary Research Institute, Chinese Academy of Agricultural Sciences, Shanghai, China.

Weiqi Guo Xinyu Wang Jiangang Hu Beibei Zhang Luru Zhao

The widespread chronic enteritis known as Paratuberculosis (PTB) or Johne's disease (JD) is caused by subspecies (MAP), posing a significant threat to global public health. Given the challenges associated with PTB or JD, the development and application of vaccines are potentially important for disease control. The aim of this study was to design a multi-epitope vaccine against MAP.

View Article and Find Full Text PDF

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!