MONI (Rossi et al., 2022) is a BWT-based compressed index for computing the matching statistics and maximal exact matches (MEMs) of a pattern (usually a DNA read) with respect to a highly repetitive text (usually a database of genomes) using two operations: LF-steps and longest common extension (LCE) queries on a grammar-compressed representation of the text. In practice, most of the operations are constant-time LF-steps but most of the time is spent evaluating LCE queries.
View Article and Find Full Text PDFFor taxonomic classification, we are asked to index the genomes in a phylogenetic tree such that later, given a DNA read, we can quickly choose a small subtree likely to contain the genome from which that read was drawn. Although popular classifiers such as Kraken use -mers, recent research indicates that using maximal exact matches (MEMs) can lead to better classifications. For example, we can ■ build an augmented FM-index over the the genomes in the tree concatenated in left-to-right order; ■ for each MEM in a read, find the interval in the suffix array containing the starting positions of that MEM's occurrences in those genomes; ■ find the minimum and maximum values stored in that interval; ■ take the lowest common ancestor (LCA) of the genomes containing the characters at those positions.
View Article and Find Full Text PDFOne approach to reduce the side effects of chemotherapy in cancer treatment is photodynamic therapy (PDT), which allows spatiotemporal control of the cytotoxicity. We have used the strategy of coordinating π-expansive ligands to increase the excited state lifetimes of Ir(III) half-sandwich complexes in order to facilitate the generation of O. We have obtained derivatives of formulas [Cp*Ir(CN)Cl] and [Cp*Ir(CN)L]BF with different degrees of π-expansion in the CN ligands.
View Article and Find Full Text PDFBackground: Recently, extensive cancer genomic studies have revealed mutational and clinical data of large cohorts of cancer patients. For example, the Pan-Lung Cancer 2016 dataset (part of The Cancer Genome Atlas project), summarises the mutational and clinical profiles of different subtypes of Lung Cancer (LC). Mutational and clinical signatures have been used independently for tumour typification and prediction of metastasis in LC patients.
View Article and Find Full Text PDFProc Data Compress Conf
March 2021
Computing the matching statistics of patterns with respect to a text is a fundamental task in bioinformatics, but a formidable one when the text is a highly compressed genomic database. Bannai et al. gave an efficient solution for this case, which Rossi et al.
View Article and Find Full Text PDFPrefix-free parsing (PFP) was introduced by Boucher et al. (2019) as a preprocessing step to ease the computation of Burrows-Wheeler Transforms (BWTs) of genomic databases. Given a string , it produces a dictionary and a parse of overlapping phrases such that BWT() can be computed from and in time and workspace bounded in terms of their combined size |PFP()|.
View Article and Find Full Text PDFJ Bioinform Comput Biol
June 2019
Signaling pathways are responsible for the regulation of cell processes, such as monitoring the external environment, transmitting information across membranes, and making cell fate decisions. Given the increasing amount of biological data available and the recent discoveries showing that many diseases are related to the disruption of cellular signal transduction cascades, in silico discovery of signaling pathways in cell biology has become an active research topic in past years. However, reconstruction of signaling pathways remains a challenge mainly because of the need for systematic approaches for predicting causal relationships, like edge direction and activation/inhibition among interacting proteins in the signal flow.
View Article and Find Full Text PDFMotivation: Genome repositories are growing faster than our storage capacities, challenging our ability to store, transmit, process and analyze them. While genomes are not very compressible individually, those repositories usually contain myriads of genomes or genome reads of the same species, thereby creating opportunities for orders-of-magnitude compression by exploiting inter-genome similarities. A useful compression system, however, cannot be only usable for archival, but it must allow direct access to the sequences, ideally in transparent form so that applications do not need to be rewritten.
View Article and Find Full Text PDFSuffix trees are one of the most versatile data structures in stringology, with many applications in bioinformatics. Their main drawback is their size, which can be tens of times larger than the input sequence. Much effort has been put into reducing the space usage, leading ultimately to compressed suffix trees.
View Article and Find Full Text PDFMany proteins work together with others in groups called complexes in order to achieve a specific function. Discovering protein complexes is important for understanding biological processes and predict protein functions in living organisms. Large-scale and throughput techniques have made possible to compile protein-protein interaction networks (PPI networks), which have been used in several computational approaches for detecting protein complexes.
View Article and Find Full Text PDFMost of the fastest-growing string collections today are repetitive, that is, most of the constituent documents are similar to many others. As these collections keep growing, a key approach to handling them is to exploit their repetitiveness, which can reduce their space usage by orders of magnitude. We study the problem of indexing repetitive string collections in order to perform efficient document retrieval operations on them.
View Article and Find Full Text PDFWe consider the problem of encoding range minimum queries (RMQs): given an array A[1..n] of distinct totally ordered values, to pre-process A and create a data structure that can answer the query RMQ(i,j), which returns the index containing the smallest element in A[i.
View Article and Find Full Text PDFBackground: The Andes-Amazon basin of Peru and Bolivia is one of the most data-poor, biologically rich, and rapidly changing areas of the world. Conservation scientists agree that this area hosts extremely high endemism, perhaps the highest in the world, yet we know little about the geographic distributions of these species and ecosystems within country boundaries. To address this need, we have developed conservation data on endemic biodiversity (~800 species of birds, mammals, amphibians, and plants) and terrestrial ecological systems (~90; groups of vegetation communities resulting from the action of ecological processes, substrates, and/or environmental gradients) with which we conduct a fine scale conservation prioritization across the Amazon watershed of Peru and Bolivia.
View Article and Find Full Text PDFA repetitive sequence collection is a set of sequences which are small variations of each other. A prominent example are genome sequences of individuals of the same or close species, where the differences can be expressed by short lists of basic edit operations. Flexible and efficient data analysis on such a typically huge collection is plausible using suffix trees.
View Article and Find Full Text PDFIEEE Trans Pattern Anal Mach Intell
September 2008
We introduce a new probabilistic proximity search algorithm for range and K-nearest neighbor (K-NN) searching in both coordinate and metric spaces. Although there exist solutions for these problems, they boil down to a linear scan when the space is intrinsically high-dimensional, as is the case in many pattern recognition tasks. This, for example, renders the K-NN approach to classification rather slow in large databases.
View Article and Find Full Text PDFThe problem of fast exact and approximate searching for a pattern that contains classes of characters and bounded size gaps (CBG) in a text has a wide range of applications, among which a very important one is protein pattern matching (for instance, one PROSITE protein site is associated with the CBG [RK] - x(2,3) - [DE] - x(2,3) - Y, where the brackets match any of the letters inside, and x(2,3) a gap of length between 2 and 3). Currently, the only way to search for a CBG in a text is to convert it into a full regular expression (RE). However, a RE is more sophisticated than a CBG, and searching for it with a RE pattern matching algorithm complicates the search and makes it slow.
View Article and Find Full Text PDF