Computational pangenomics is an emerging research field that is changing the way computer scientists are facing challenges in biological sequence analysis. In past decades, contributions from combinatorics, stringology, graph theory and data structures were essential in the development of a plethora of software tools for the analysis of the human genome. These tools allowed computational biologists to approach ambitious projects at population scale, such as the 1000 Genomes Project.
View Article and Find Full Text PDFIn the last years, the medicinal plant (L.) Britton has gained scientific interest because leaf extracts, due to the presence of rosmarinic acid and other polyphenols, have shown anti-allergic and skin protective potential in pre-clinical studies. Nevertheless, the lack of standardized extracts has limited clinical applications to date.
View Article and Find Full Text PDFMotivation: Recent advances in high-throughput RNA-Seq technologies allow to produce massive datasets. When a study focuses only on a handful of genes, most reads are not relevant and degrade the performance of the tools used to analyze the data. Removing irrelevant reads from the input dataset leads to improved efficiency without compromising the results of the study.
View Article and Find Full Text PDFBackground: While the reconstruction of transcripts from a sample of RNA-Seq data is a computationally expensive and complicated task, the detection of splicing events from RNA-Seq data and a gene annotation is computationally feasible. This latter task, which is adequate for many transcriptome analyses, is usually achieved by aligning the reads to a reference genome, followed by comparing the alignments with a gene annotation, often implicitly represented by a graph: the splicing graph.
Results: We present ASGAL (Alternative Splicing Graph ALigner): a tool for mapping RNA-Seq data to the splicing graph, with the specific goal of detecting novel splicing events, involving either annotated or unannotated splice sites.
The string graph for a collection of next-generation reads is a lossless data representation that is fundamental for de novo assemblers based on the overlap-layout-consensus paradigm. In this article, we explore a novel approach to compute the string graph, based on the FM-index and Burrows and Wheeler Transform. We describe a simple algorithm that uses only the FM-index representation of the collection of reads to construct the string graph, without accessing the input reads.
View Article and Find Full Text PDFCancer cells often rely on glycolysis to obtain energy and support anabolic growth. Several studies showed that glycolytic cells are susceptible to cell death when subjected to low glucose availability or to lack of glucose. However, some cancer cells, including glycolytic ones, can efficiently acquire higher tolerance to glucose depletion, leading to their survival and aggressiveness.
View Article and Find Full Text PDFThe large amount of short read data that has to be assembled in future applications, such as in metagenomics or cancer genomics, strongly motivates the investigation of disk-based approaches to index next-generation sequencing (NGS) data. Positive results in this direction stimulate the investigation of efficient external memory algorithms for de novo assembly from NGS data. Our article is also motivated by the open problem of designing a space-efficient algorithm to compute a string graph using an indexing procedure based on the Burrows-Wheeler transform (BWT).
View Article and Find Full Text PDFAlternative Splicing (AS) is the molecular phenomenon whereby multiple transcripts are produced from the same gene locus. As a consequence, it is responsible for the expansion of eukaryotic transcriptomes. Aberrant AS is involved in the onset and progression of several human diseases.
View Article and Find Full Text PDFNext-generation sequencing (NGS) technologies need new methodologies for alternative splicing (AS) analysis. Current computational methods for AS analysis from NGS data are mainly based on aligning short reads against a reference genome, while methods that do not need a reference genome are mostly underdeveloped. In this context, the main developed tools for NGS data focus on de novo transcriptome assembly (Grabherr et al.
View Article and Find Full Text PDFBackground: A challenging issue in designing computational methods for predicting the gene structure into exons and introns from a cluster of transcript (EST, mRNA) sequences, is guaranteeing accuracy as well as efficiency in time and space, when large clusters of more than 20,000 ESTs and genes longer than 1 Mb are processed. Traditionally, the problem has been faced by combining different tools, not specifically designed for this task.
Results: We propose a fast method based on ad hoc procedures for solving the problem.
Alternative splicing is emerging as a major mechanism for the expansion of the transcriptome and proteome diversity, particularly in human and other vertebrates. However, the proportion of alternative transcripts and proteins actually endowed with functional activity is currently highly debated. We present here a new release of ASPicDB which now provides a unique annotation resource of human protein variants generated by alternative splicing.
View Article and Find Full Text PDFAlternative splicing (AS) is currently considered as one of the main mechanisms able to explain the huge gap between the number of predicted genes and the high complexity of the proteome in humans. The rapid growth of Expressed Sequence Tag (EST) data has encouraged the development of computational methods to predict alternative splicing from the analysis of EST alignment to genome sequences. EST data are also a valuable source to reconstruct the different transcript isoforms that derive from the same gene structure as a consequence of AS, as indeed EST sequences are obtained by fragmenting mRNAs from the same gene.
View Article and Find Full Text PDFIn this paper, we investigate the computational and approximation complexity of the Exemplar Longest Common Subsequence of a set of sequences (ELCS problem), a generalization of the Longest Common Subsequence problem, where the input sequences are over the union of two disjoint sets of symbols, a set of mandatory symbols and a set of optional symbols. We show that different versions of the problem are APX-hard even for instances with two sequences. Moreover, we show that the related problem of determining the existence of a feasible solution of the Exemplar Longest Common Subsequence of two sequences is NP-hard.
View Article and Find Full Text PDFAlternative splicing (AS) is now emerging as a major mechanism contributing to the expansion of the transcriptome and proteome complexity of multicellular organisms. The fact that a single gene locus may give rise to multiple mRNAs and protein isoforms, showing both major and subtle structural variations, is an exceptionally versatile tool in the optimization of the coding capacity of the eukaryotic genome. The huge and continuously increasing number of genome and transcript sequences provides an essential information source for the computational detection of genes AS pattern.
View Article and Find Full Text PDFThe fact that a large majority of mammalian genes are subject to alternative splicing indicates that this phenomenon represents a major mechanism for increasing proteome complexity. Here, we provide an overview of current methods for the computational prediction of alternative splicing based on the alignment of genome and transcript sequences. Specific features and limitations of different approaches and software are discussed, particularly those affecting prediction accuracy and assembly of alternative transcripts.
View Article and Find Full Text PDFBackground: Currently available methods to predict splice sites are mainly based on the independent and progressive alignment of transcript data (mostly ESTs) to the genomic sequence. Apart from often being computationally expensive, this approach is vulnerable to several problems--hence the need to develop novel strategies.
Results: We propose a method, based on a novel multiple genome-EST alignment algorithm, for the detection of splice sites.