Compressed full-text indexes enable efficient sequence classification against a pangenome or tree-of-life index. Past work on compressed-index classification used matching statistics or pseudo-matching lengths to capture the fine-grained co-linearity of exact matches. But these fail to capture coarse-grained information about whether seeds appear co-linearly in the reference.
View Article and Find Full Text PDFAs next-generation sequencing technologies produce deeper genome coverages at lower costs, there is a critical need for reliable computational host DNA removal in metagenomic data. We find that insufficient host filtration using prior human genome references can introduce false sex biases and inadvertently permit flow-through of host-specific DNA during bioinformatic analyses, which could be exploited for individual identification. To address these issues, we introduce and benchmark three host filtration methods of varying throughput, with concomitant applications across low biomass samples such as skin and high microbial biomass datasets including fecal samples.
View Article and Find Full Text PDFFor taxonomic classification, we are asked to index the genomes in a phylogenetic tree such that later, given a DNA read, we can quickly choose a small subtree likely to contain the genome from which that read was drawn. Although popular classifiers such as Kraken use -mers, recent research indicates that using maximal exact matches (MEMs) can lead to better classifications. For example, we can ■ build an augmented FM-index over the the genomes in the tree concatenated in left-to-right order; ■ for each MEM in a read, find the interval in the suffix array containing the starting positions of that MEM's occurrences in those genomes; ■ find the minimum and maximum values stored in that interval; ■ take the lowest common ancestor (LCA) of the genomes containing the characters at those positions.
View Article and Find Full Text PDFNew high-quality human genome assemblies derived from lymphoblastoid cell lines (LCLs) provide reference genomes and pangenomes for genomics studies. However, the characteristics of LCLs pose technical challenges to profiling immunoglobulin (IG) genes. IG loci in LCLs contain a mixture of germline and somatically recombined haplotypes, making them difficult to genotype or assemble accurately.
View Article and Find Full Text PDFSummary: Improvements in nanopore sequencing necessitate efficient classification methods, including pre-filtering and adaptive sampling algorithms that enrich for reads of interest. Signal-based approaches circumvent the computational bottleneck of basecalling. But past methods for signal-based classification do not scale efficiently to large, repetitive references like pangenomes, limiting their utility to partial references or individual genomes.
View Article and Find Full Text PDFUnlabelled: Taxonomic sequence classification is a computational problem central to the study of metagenomics and evolution. Advances in compressed indexing with the -index enable full-text pattern matching against large sequence collections. But the data structures that link pattern sequences to their clades of origin still do not scale well to large collections.
View Article and Find Full Text PDFPangenomes are growing in number and size, thanks to the prevalence of high-quality long-read assemblies. However, current methods for studying sequence composition and conservation within pangenomes have limitations. Methods based on graph pangenomes require a computationally expensive multiple-alignment step, which can leave out some variation.
View Article and Find Full Text PDFCentrifuger is an efficient taxonomic classification method that compares sequencing reads against a microbial genome database. In Centrifuger, the Burrows-Wheeler transformed genome sequences are losslessly compressed using a novel scheme called run-block compression. Run-block compression achieves sublinear space complexity and is effective at compressing diverse microbial databases like RefSeq while supporting fast rank queries.
View Article and Find Full Text PDFMany bioinformatics methods seek to reduce reference bias, but no methods exist to comprehensively measure it. Biastools analyzes and categorizes instances of reference bias. It works in various scenarios: when the donor's variants are known and reads are simulated; when donor variants are known and reads are real; and when variants are unknown and reads are real.
View Article and Find Full Text PDFWe investigate a relatively underexplored component of the gut-immune axis by profiling the antibody response to gut phages using Phage Immunoprecipitation Sequencing (PhIP-Seq). To cover large antigenic spaces, we develop Dolphyn, a method that uses machine learning to select peptides from protein sets and compresses the proteome through epitope-stitching. Dolphyn compresses the size of a peptide library by 78% compared to traditional tiling, increasing the antibody-reactive peptides from 10% to 31%.
View Article and Find Full Text PDFGenome assembly databases are growing rapidly. The redundancy of sequence content between a new assembly and previous ones is neither conceptually nor algorithmically easy to measure. We introduce pertinent methods and DandD, a tool addressing how much new sequence is gained when a sequence collection grows.
View Article and Find Full Text PDFPangenome indexes reduce reference bias in sequencing data analysis. However, bias can be reduced further by using a personalized reference, e.g.
View Article and Find Full Text PDFComplete, telomere-to-telomere (T2T) genome assemblies promise improved analyses and the discovery of new variants, but many essential genomic resources remain associated with older reference genomes. Thus, there is a need to translate genomic features and read alignments between references. Here we describe a method called levioSAM2 that performs fast and accurate lift-over between assemblies using a whole-genome map.
View Article and Find Full Text PDFCentrifuger is an efficient taxonomic classification method that compares sequencing reads against a microbial genome database. In Centrifuger, the Burrows-Wheeler transformed genome sequences are losslessly compressed using a novel scheme called run-block compression. Run-block compression achieves sublinear space complexity and is effective at compressing diverse microbial databases like RefSeq while supporting fast rank queries.
View Article and Find Full Text PDFEfficient pangenome indexes are promising tools for many applications, including rapid classification of nanopore sequencing reads. Recently, a compressed-index data structure called the "move structure" was proposed as an alternative to other BWT-based indexes like the FM index and r-index. The move structure uniquely achieves both O(r) space and O(1)-time queries, where r is the number of runs in the pangenome BWT.
View Article and Find Full Text PDFMany bioinformatics methods seek to reduce reference bias, but no methods exist to comprehensively measure it. Biastools analyzes and categorizes instances of reference bias. It works in various scenarios, i.
View Article and Find Full Text PDFImprovements in nanopore sequencing necessitate efficient classification methods, including pre-filtering and adaptive sampling algorithms that enrich for reads of interest. Signal-based approaches circumvent the computational bottleneck of basecalling. But past methods for signal-based classification do not scale efficiently to large, repetitive references like pangenomes, limiting their utility to partial references or individual genomes.
View Article and Find Full Text PDFWe investigated a relatively underexplored component of the gut-immune axis by profiling the antibody response to gut phages using Phage Immunoprecipitation Sequencing (PhIP-Seq). To enhance this approach, we developed Dolphyn, a novel method that uses machine learning to select peptides from protein sets and compresses the proteome through epitope-stitching. Dolphyn improves the fraction of gut phage library peptides bound by antibodies from 10% to 31% in healthy individuals, while also reducing the number of synthesized peptides by 78%.
View Article and Find Full Text PDFA Wheeler graph represents a collection of strings in a way that is particularly easy to index and query. Such a graph is a practical choice for representing a graph-shaped pangenome, and it is the foundation for current graph-based pangenome indexes. However, there are no practical tools to visualize or to check graphs that may have the Wheeler properties.
View Article and Find Full Text PDFA genomic sketch is a small, probabilistic representation of the set of k-mers in a sequencing data set. Sketches are building blocks for large-scale analyses that consider similarities between many pairs of sequences or sequence collections. Although existing tools can easily compare tens of thousands of genomes, data sets can reach millions of sequences and beyond.
View Article and Find Full Text PDFTools that classify sequencing reads against a database of reference sequences require efficient index data-structures. The -index is a compressed full-text index that answers substring presence/absence, count, and locate queries in space proportional to the amount of distinct sequence in the database: [Formula: see text] space, where is the number of Burrows-Wheeler runs. To date, the -index has lacked the ability to quickly classify matches according to which reference sequences (or sequence groupings, i.
View Article and Find Full Text PDFGenomics analyses use large reference sequence collections, like pangenomes or taxonomic databases. SPUMONI 2 is an efficient tool for sequence classification of both short and long reads. It performs multi-class classification using a novel sampled document array.
View Article and Find Full Text PDFWe present a new method and software tool called rowbowt that applies a pangenome index to the problem of inferring genotypes from short-read sequencing data. The method uses a novel indexing structure called the marker array. Using the marker array, we can genotype variants with respect from large panels like the 1000 Genomes Project while reducing the reference bias that results when aligning to a single linear reference.
View Article and Find Full Text PDFGenome assembly databases are growing rapidly. The sequence content in each new assembly can be largely redundant with previous ones, but this is neither conceptually nor algorithmically easy to measure. We propose new methods and a new tool called DandD that addresses the question of how much new sequence is gained when a sequence collection grows.
View Article and Find Full Text PDF