Pangenomes are becoming a powerful framework to perform many bioinformatics analyses taking into account the genetic variability of a population, thus reducing the bias introduced by a single reference genome. With the wider diffusion of pangenomes, integrating genetic variability with transcriptome diversity is becoming a natural extension that demands specific methods for its exploration. In this work, we extend the notion of spliced pangenomes to that of annotated spliced pangenomes; this allows us to introduce a formal definition of Alternative Splicing (AS) events on a graph structure.
View Article and Find Full Text PDFBackground: The construction of a pangenome graph is a fundamental task in pangenomics. A natural theoretical question is how to formalize the computational problem of building an optimal pangenome graph, making explicit the underlying optimization criterion and the set of feasible solutions. Current approaches build a pangenome graph with some heuristics, without assuming some explicit optimization criteria.
View Article and Find Full Text PDFMotivation: Bacterial genomes present more variability than human genomes, which requires important adjustments in computational tools that are developed for human data. In particular, bacteria exhibit a mosaic structure due to homologous recombinations, but this fact is not sufficiently captured by standard read mappers that align against linear reference genomes. The recent introduction of pangenomics provides some insights in that context, as a pangenome graph can represent the variability within a species.
View Article and Find Full Text PDFThe literature has demonstrated the potential for detecting accurate electrical signals that correspond to the will or intention to move, as well as decoding the thoughts of individuals who imagine houses, faces or objects. This investigation examines the presence of precise neural markers of imagined motivational states through the combining of electrophysiological and neuroimaging methods. 20 participants were instructed to vividly imagine the desire to move, listen to music or engage in social activities.
View Article and Find Full Text PDFComputational pangenomics is an emerging research field that is changing the way computer scientists are facing challenges in biological sequence analysis. In past decades, contributions from combinatorics, stringology, graph theory and data structures were essential in the development of a plethora of software tools for the analysis of the human genome. These tools allowed computational biologists to approach ambitious projects at population scale, such as the 1000 Genomes Project.
View Article and Find Full Text PDFThe massive amount of genomic data appearing for SARS-CoV-2 since the beginning of the COVID-19 pandemic has challenged traditional methods for studying its dynamics. As a result, new methods such as Pangolin, which can scale to the millions of samples of SARS-CoV-2 currently available, have appeared. Such a tool is tailored to take as input assembled, aligned, and curated full-length sequences, such as those found in the GISAID database.
View Article and Find Full Text PDFBackground: Since the beginning of the coronavirus disease 2019 pandemic, there has been an explosion of sequencing of the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) virus, making it the most widely sequenced virus in the history. Several databases and tools have been created to keep track of genome sequences and variants of the virus; most notably, the GISAID platform hosts millions of complete genome sequences, and it is continuously expanding every day. A challenging task is the development of fast and accurate tools that are able to distinguish between the different SARS-CoV-2 variants and assign them to a clade.
View Article and Find Full Text PDFBackground: Being able to efficiently call variants from the increasing amount of sequencing data daily produced from multiple viral strains is of the utmost importance, as demonstrated during the COVID-19 pandemic, in order to track the spread of the viral strains across the globe.
Results: We present MALVIRUS, an easy-to-install and easy-to-use application that assists users in multiple tasks required for the analysis of a viral population, such as the SARS-CoV-2. MALVIRUS allows to: (1) construct a variant catalog consisting in a set of variations (SNPs/indels) from the population sequences, (2) efficiently genotype and annotate variants of the catalog supported by a read sample, and (3) when the considered viral species is the SARS-CoV-2, assign the input sample to the most likely Pango lineages using the genotyped variations.
In the recent years, there has been an increasing amount of single-cell sequencing studies, producing a considerable number of new data sets. This has particularly affected the field of cancer analysis, where more and more articles are published using this sequencing technique that allows for capturing more detailed information regarding the specific genetic mutations on each individually sampled cell. As the amount of information increases, it is necessary to have more sophisticated and rapid tools for analyzing the samples.
View Article and Find Full Text PDFIEEE J Biomed Health Inform
November 2021
Single cell sequencing (SCS) technologies provide a level of resolution that makes it indispensable for inferring from a sequenced tumor, evolutionary trees or phylogenies representing an accumulation of cancerous mutations. A drawback of SCS is elevated false negative and missing value rates, resulting in a large space of possible solutions, which in turn makes it difficult, sometimes infeasible using current approaches and tools. One possible solution is to reduce the size of an SCS instance - usually represented as a matrix of presence, absence, and uncertainty of the mutations found in the different sequenced cells - and to infer the tree from this reduced-size instance.
View Article and Find Full Text PDFBackground: Cancer progression reconstruction is an important development stemming from the phylogenetics field. In this context, the reconstruction of the phylogeny representing the evolutionary history presents some peculiar aspects that depend on the technology used to obtain the data to analyze: Single Cell DNA Sequencing data have great specificity, but are affected by moderate false negative and missing value rates. Moreover, there has been some recent evidence of back mutations in cancer: this phenomenon is currently widely ignored.
View Article and Find Full Text PDFMotivation: Recent advances in high-throughput RNA-Seq technologies allow to produce massive datasets. When a study focuses only on a handful of genes, most reads are not relevant and degrade the performance of the tools used to analyze the data. Removing irrelevant reads from the input dataset leads to improved efficiency without compromising the results of the study.
View Article and Find Full Text PDFMotivation: In recent years, the well-known Infinite Sites Assumption has been a fundamental feature of computational methods devised for reconstructing tumor phylogenies and inferring cancer progressions. However, recent studies leveraging single-cell sequencing (SCS) techniques have shown evidence of the widespread recurrence and, especially, loss of mutations in several tumor samples. While there exist established computational methods that infer phylogenies with mutation losses, there remain some advancements to be made.
View Article and Find Full Text PDFMotivation: The latest advances in cancer sequencing, and the availability of a wide range of methods to infer the evolutionary history of tumors, have made it important to evaluate, reconcile and cluster different tumor phylogenies. Recently, several notions of distance or similarities have been proposed in the literature, but none of them has emerged as the golden standard. Moreover, none of the known similarity measures is able to manage mutations occurring multiple times in the tree, a circumstance often occurring in real cases.
View Article and Find Full Text PDFIEEE/ACM Trans Comput Biol Bioinform
March 2020
Most of the evolutionary history reconstruction approaches are based on the infinite sites assumption, which states that mutations appear once in the evolutionary history. The Perfect Phylogeny model is the result of the infinite sites assumption and has been widely used to infer cancer evolution. Nonetheless, recent results show that recurrent and back mutations are present in the evolutionary history of tumors, hence the Perfect Phylogeny model might be too restrictive.
View Article and Find Full Text PDFBackground: While the reconstruction of transcripts from a sample of RNA-Seq data is a computationally expensive and complicated task, the detection of splicing events from RNA-Seq data and a gene annotation is computationally feasible. This latter task, which is adequate for many transcriptome analyses, is usually achieved by aligning the reads to a reference genome, followed by comparing the alignments with a gene annotation, often implicitly represented by a graph: the splicing graph.
Results: We present ASGAL (Alternative Splicing Graph ALigner): a tool for mapping RNA-Seq data to the splicing graph, with the specific goal of detecting novel splicing events, involving either annotated or unannotated splice sites.
The availability of genomic datasets in association with clinical, phenotypic, and drug sensitivity information represents an invaluable source for potential therapeutic applications, supporting the identification of new drug sensitivity biomarkers and pharmacological targets. Drug discovery and precision oncology can largely benefit from the integration of treatment molecular discriminants obtained from cell line models and clinical tumor samples; however this task demands comprehensive analysis approaches for the discovery of underlying data connections. Here we introduce PATRI (Platform for the Analysis of TRanslational Integrated data), a standalone tool accessible through a user-friendly graphical interface, conceived for the identification of treatment sensitivity biomarkers from user-provided genomics data, associated with information on sample characteristics.
View Article and Find Full Text PDFBackground: Haplotype assembly is the process of assigning the different alleles of the variants covered by mapped sequencing reads to the two haplotypes of the genome of a human individual. Long reads, which are nowadays cheaper to produce and more widely available than ever before, have been used to reduce the fragmentation of the assembled haplotypes since their ability to span several variants along the genome. These long reads are also characterized by a high error rate, an issue which may be mitigated, however, with larger sets of reads, when this error rate is uniform across genome positions.
View Article and Find Full Text PDFThe string graph for a collection of next-generation reads is a lossless data representation that is fundamental for de novo assemblers based on the overlap-layout-consensus paradigm. In this article, we explore a novel approach to compute the string graph, based on the FM-index and Burrows and Wheeler Transform. We describe a simple algorithm that uses only the FM-index representation of the collection of reads to construct the string graph, without accessing the input reads.
View Article and Find Full Text PDFThe large amount of short read data that has to be assembled in future applications, such as in metagenomics or cancer genomics, strongly motivates the investigation of disk-based approaches to index next-generation sequencing (NGS) data. Positive results in this direction stimulate the investigation of efficient external memory algorithms for de novo assembly from NGS data. Our article is also motivated by the open problem of designing a space-efficient algorithm to compute a string graph using an indexing procedure based on the Burrows-Wheeler transform (BWT).
View Article and Find Full Text PDFLife sciences are yielding huge data sets that underpin scientific discoveries fundamental to improvement in human health, agriculture and the environment. In support of these discoveries, a plethora of databases and tools are deployed, in technically complex and diverse implementations, across a spectrum of scientific disciplines. The corpus of documentation of these resources is fragmented across the Web, with much redundancy, and has lacked a common standard of information.
View Article and Find Full Text PDFAlternative Splicing (AS) is the molecular phenomenon whereby multiple transcripts are produced from the same gene locus. As a consequence, it is responsible for the expansion of eukaryotic transcriptomes. Aberrant AS is involved in the onset and progression of several human diseases.
View Article and Find Full Text PDFBackground: The perfect phylogeny is an often used model in phylogenetics since it provides an efficient basic procedure for representing the evolution of genomic binary characters in several frameworks, such as for example in haplotype inference. The model, which is conceptually the simplest, is based on the infinite sites assumption, that is no character can mutate more than once in the whole tree. A main open problem regarding the model is finding generalizations that retain the computational tractability of the original model but are more flexible in modeling biological data when the infinite site assumption is violated because of e.
View Article and Find Full Text PDFPurpose: Osteoporosis is a chronic disease of the bone, whose incidence increases progressively with aging. The main consequences of osteoporosis are fragility fractures, which have considerable medical, social, and economic implications. Adequate treatment of osteoporosis must be considered as a compelling public health intervention.
View Article and Find Full Text PDF