Errors in multiple sequence alignments (MSAs) are known to bias many comparative evolutionary methods. In the context of natural selection analyses, specifically codon evolutionary models, excessive rates of false positives result. A characteristic signature of error-driven findings is unrealistically high estimates of dN/dS (e.
View Article and Find Full Text PDFOur ability to generate sequencing data and assemble it into high quality complete genomes has rapidly advanced in recent years. These data promise to advance our understanding of organismal biology and answer longstanding evolutionary questions. Multiple genome alignment is a key tool in this quest.
View Article and Find Full Text PDFImprovements in genome sequencing and assembly are enabling high-quality reference genomes for all species. However, the assembly process is still laborious, computationally and technically demanding, lacks standards for reproducibility, and is not readily scalable. Here we present the latest Vertebrate Genomes Project assembly pipeline and demonstrate that it delivers high-quality reference genomes at scale across a set of vertebrate species arising over the last ~500 million years.
View Article and Find Full Text PDFThere are thousands of well-maintained high-quality open-source software utilities for all aspects of scientific data analysis. For more than a decade, the Galaxy Project has been providing computational infrastructure and a unified user interface for these tools to make them accessible to a wide range of researchers. To streamline the process of integrating tools and constructing workflows as much as possible, we have developed Planemo, a software development kit for tool and workflow developers and Galaxy power users.
View Article and Find Full Text PDFAn important unmet need revealed by the COVID-19 pandemic is the near-real-time identification of potentially fitness-altering mutations within rapidly growing SARS-CoV-2 lineages. Although powerful molecular sequence analysis methods are available to detect and characterize patterns of natural selection within modestly sized gene-sequence datasets, the computational complexity of these methods and their sensitivity to sequencing errors render them effectively inapplicable in large-scale genomic surveillance contexts. Motivated by the need to analyze new lineage evolution in near-real time using large numbers of genomes, we developed the Rapid Assessment of Selection within CLades (RASCL) pipeline.
View Article and Find Full Text PDFAmong the 30 nonsynonymous nucleotide substitutions in the Omicron S-gene are 13 that have only rarely been seen in other SARS-CoV-2 sequences. These mutations cluster within three functionally important regions of the S-gene at sites that will likely impact (1) interactions between subunits of the Spike trimer and the predisposition of subunits to shift from down to up configurations, (2) interactions of Spike with ACE2 receptors, and (3) the priming of Spike for membrane fusion. We show here that, based on both the rarity of these 13 mutations in intrapatient sequencing reads and patterns of selection at the codon sites where the mutations occur in SARS-CoV-2 and related sarbecoviruses, prior to the emergence of Omicron the mutations would have been predicted to decrease the fitness of any virus within which they occurred.
View Article and Find Full Text PDFThe NHGRI Genomic Data Science Analysis, Visualization, and Informatics Lab-space (AnVIL; https://anvilproject.org) was developed to address a widespread community need for a unified computing environment for genomics data storage, management, and analysis. In this perspective, we present AnVIL, describe its ecosystem and interoperability with other platforms, and highlight how this platform and associated initiatives contribute to improved genomic data sharing efforts.
View Article and Find Full Text PDFUnlabelled: An important component of efforts to manage the ongoing COVID19 pandemic is the R apid A ssessment of how natural selection contributes to the emergence and proliferation of potentially dangerous S ARS-CoV-2 lineages and CL ades (RASCL). The RASCL pipeline enables continuous comparative phylogenetics-based selection analyses of rapidly growing clade-focused genome surveillance datasets, such as those produced following the initial detection of potentially dangerous variants. From such datasets RASCL automatically generates down-sampled codon alignments of individual genes/ORFs containing contextualizing background reference sequences, analyzes these with a battery of selection tests, and outputs results as both machine readable JSON files, and interactive notebook-based visualizations.
View Article and Find Full Text PDFAmong the 30 non-synonymous nucleotide substitutions in the Omicron S-gene are 13 that have only rarely been seen in other SARS-CoV-2 sequences. These mutations cluster within three functionally important regions of the S-gene at sites that will likely impact (i) interactions between subunits of the Spike trimer and the predisposition of subunits to shift from down to up configurations, (ii) interactions of Spike with ACE2 receptors, and (iii) the priming of Spike for membrane fusion. We show here that, based on both the rarity of these 13 mutations in intrapatient sequencing reads and patterns of selection at the codon sites where the mutations occur in SARS-CoV-2 and related sarbecoviruses, prior to the emergence of Omicron the mutations would have been predicted to decrease the fitness of any genomes within which they occurred.
View Article and Find Full Text PDFThe programmed frameshift element (PFE) rerouting translation from ORF1a to ORF1b is essential for the propagation of coronaviruses. The combination of genomic features that make up PFE-the overlap between the two reading frames, a slippery sequence, as well as an ensemble of complex secondary structure elements-places severe constraints on this region as most possible nucleotide substitution may disrupt one or more of these elements. The vast amount of SARS-CoV-2 sequencing data generated within the past year provides an opportunity to assess the evolutionary dynamics of PFE in great detail.
View Article and Find Full Text PDFThe programmed frameshift element (PFE) rerouting translation from to is essential for propagation of coronaviruses. A combination of genomic features that make up PFE-the overlap between the two reading frames, a slippery sequence, as well as an ensemble of complex secondary structure elements-puts severe constraints on this region as most possible nucleotide substitution may disrupt one or more of these elements. The vast amount of SARS-CoV-2 sequencing data generated within the past year provides an opportunity to assess evolutionary dynamics of PFE in great detail.
View Article and Find Full Text PDFBackground: Significant progress has been made in advancing and standardizing tools for human genomic and biomedical research. Yet, the field of next-generation sequencing (NGS) analysis for microorganisms (including multiple pathogens) remains fragmented, lacks accessible and reusable tools, is hindered by local computational resource limitations, and does not offer widely accepted standards. One such "problem areas" is the analysis of Transposon Insertion Sequencing (TIS) data.
View Article and Find Full Text PDFThe COVID-19 pandemic is shifting teaching to an online setting all over the world. The Galaxy framework facilitates the online learning process and makes it accessible by providing a library of high-quality community-curated training materials, enabling easy access to data and tools, and facilitates sharing achievements and progress between students and instructors. By combining Galaxy with robust communication channels, effective instruction can be designed inclusively, regardless of the students' environments.
View Article and Find Full Text PDFSequencing technology has achieved great advances in the past decade. Studies have previously shown the quality of specific instruments in controlled conditions. Here, we developed a method able to retroactively determine the error rate of most public sequencing datasets.
View Article and Find Full Text PDFThe COVID-19 pandemic is the first global health crisis to occur in the age of big genomic data.Although data generation capacity is well established and sufficiently standardized, analytical capacity is not. To establish analytical capacity it is necessary to pull together global computational resources and deliver the best open source tools and analysis workflows within a ready to use, universally accessible resource.
View Article and Find Full Text PDFDuplex sequencing is currently the most reliable method to identify ultra-low frequency DNA variants by grouping sequence reads derived from the same DNA molecule into families with information on the forward and reverse strand. However, only a small proportion of reads are assembled into duplex consensus sequences (DCS), and reads with potentially valuable information are discarded at different steps of the bioinformatics pipeline, especially reads without a family. We developed a bioinformatics toolset that analyses the tag and family composition with the purpose to understand data loss and implement modifications to maximize the data output for the variant calling.
View Article and Find Full Text PDFBackground: The vast ecosystem of single-cell RNA-sequencing tools has until recently been plagued by an excess of diverging analysis strategies, inconsistent file formats, and compatibility issues between different software suites. The uptake of 10x Genomics datasets has begun to calm this diversity, and the bioinformatics community leans once more towards the large computing requirements and the statistically driven methods needed to process and understand these ever-growing datasets.
Results: Here we outline several Galaxy workflows and learning resources for single-cell RNA-sequencing, with the aim of providing a comprehensive analysis environment paired with a thorough user learning experience that bridges the knowledge gap between the computational methods and the underlying cell biology.
The current state of much of the Wuhan pneumonia virus (severe acute respiratory syndrome coronavirus 2 [SARS-CoV-2]) research shows a regrettable lack of data sharing and considerable analytical obfuscation. This impedes global research cooperation, which is essential for tackling public health emergencies and requires unimpeded access to data, analysis tools, and computational infrastructure. Here, we show that community efforts in developing open analytical software tools over the past 10 years, combined with national investments into scientific computational infrastructure, can overcome these deficiencies and provide an accessible platform for tackling global health emergencies in an open and transparent manner.
View Article and Find Full Text PDF