Publications by authors named "Anton Nekrutenko"

Errors in multiple sequence alignments (MSAs) are known to bias many comparative evolutionary methods. In the context of natural selection analyses, specifically codon evolutionary models, excessive rates of false positives result. A characteristic signature of error-driven findings is unrealistically high estimates of dN/dS (e.

View Article and Find Full Text PDF

Our ability to generate sequencing data and assemble it into high quality complete genomes has rapidly advanced in recent years. These data promise to advance our understanding of organismal biology and answer longstanding evolutionary questions. Multiple genome alignment is a key tool in this quest.

View Article and Find Full Text PDF

Improvements in genome sequencing and assembly are enabling high-quality reference genomes for all species. However, the assembly process is still laborious, computationally and technically demanding, lacks standards for reproducibility, and is not readily scalable. Here we present the latest Vertebrate Genomes Project assembly pipeline and demonstrate that it delivers high-quality reference genomes at scale across a set of vertebrate species arising over the last ~500 million years.

View Article and Find Full Text PDF
Article Synopsis
  • - The text discusses the importance of protein-protein interactions in cellular processes and how identifying these interactions can lead to new drug targets for diseases.
  • - An automated pipeline was developed to predict protein-protein interactions across genomes, demonstrating success in modeling interactions in both human and yeast proteins, particularly in relation to SARS-CoV2.
  • - The method produces reliable interaction models that can be experimentally validated, and the pipeline is publicly accessible at specific Galaxy platforms.
View Article and Find Full Text PDF

There are thousands of well-maintained high-quality open-source software utilities for all aspects of scientific data analysis. For more than a decade, the Galaxy Project has been providing computational infrastructure and a unified user interface for these tools to make them accessible to a wide range of researchers. To streamline the process of integrating tools and constructing workflows as much as possible, we have developed Planemo, a software development kit for tool and workflow developers and Galaxy power users.

View Article and Find Full Text PDF

An important unmet need revealed by the COVID-19 pandemic is the near-real-time identification of potentially fitness-altering mutations within rapidly growing SARS-CoV-2 lineages. Although powerful molecular sequence analysis methods are available to detect and characterize patterns of natural selection within modestly sized gene-sequence datasets, the computational complexity of these methods and their sensitivity to sequencing errors render them effectively inapplicable in large-scale genomic surveillance contexts. Motivated by the need to analyze new lineage evolution in near-real time using large numbers of genomes, we developed the Rapid Assessment of Selection within CLades (RASCL) pipeline.

View Article and Find Full Text PDF
Article Synopsis
  • * In this case, an individual was superinfected with two SARS-CoV-2 variants, Alpha (B.1.1.7) and Epsilon (B.1.429), which led to unexpected genomic characteristics in the Alpha variant.
  • * Full genome sequencing indicated that the Alpha variant made up about 75% of the viral presence, with the Epsilon variant at around 20%, and revealed multiple recombinant forms that could influence the virus's evolution.
View Article and Find Full Text PDF

Among the 30 nonsynonymous nucleotide substitutions in the Omicron S-gene are 13 that have only rarely been seen in other SARS-CoV-2 sequences. These mutations cluster within three functionally important regions of the S-gene at sites that will likely impact (1) interactions between subunits of the Spike trimer and the predisposition of subunits to shift from down to up configurations, (2) interactions of Spike with ACE2 receptors, and (3) the priming of Spike for membrane fusion. We show here that, based on both the rarity of these 13 mutations in intrapatient sequencing reads and patterns of selection at the codon sites where the mutations occur in SARS-CoV-2 and related sarbecoviruses, prior to the emergence of Omicron the mutations would have been predicted to decrease the fitness of any virus within which they occurred.

View Article and Find Full Text PDF

The NHGRI Genomic Data Science Analysis, Visualization, and Informatics Lab-space (AnVIL; https://anvilproject.org) was developed to address a widespread community need for a unified computing environment for genomics data storage, management, and analysis. In this perspective, we present AnVIL, describe its ecosystem and interoperability with other platforms, and highlight how this platform and associated initiatives contribute to improved genomic data sharing efforts.

View Article and Find Full Text PDF

Unlabelled: An important component of efforts to manage the ongoing COVID19 pandemic is the R apid A ssessment of how natural selection contributes to the emergence and proliferation of potentially dangerous S ARS-CoV-2 lineages and CL ades (RASCL). The RASCL pipeline enables continuous comparative phylogenetics-based selection analyses of rapidly growing clade-focused genome surveillance datasets, such as those produced following the initial detection of potentially dangerous variants. From such datasets RASCL automatically generates down-sampled codon alignments of individual genes/ORFs containing contextualizing background reference sequences, analyzes these with a battery of selection tests, and outputs results as both machine readable JSON files, and interactive notebook-based visualizations.

View Article and Find Full Text PDF

Among the 30 non-synonymous nucleotide substitutions in the Omicron S-gene are 13 that have only rarely been seen in other SARS-CoV-2 sequences. These mutations cluster within three functionally important regions of the S-gene at sites that will likely impact (i) interactions between subunits of the Spike trimer and the predisposition of subunits to shift from down to up configurations, (ii) interactions of Spike with ACE2 receptors, and (iii) the priming of Spike for membrane fusion. We show here that, based on both the rarity of these 13 mutations in intrapatient sequencing reads and patterns of selection at the codon sites where the mutations occur in SARS-CoV-2 and related sarbecoviruses, prior to the emergence of Omicron the mutations would have been predicted to decrease the fitness of any genomes within which they occurred.

View Article and Find Full Text PDF

The programmed frameshift element (PFE) rerouting translation from ORF1a to ORF1b is essential for the propagation of coronaviruses. The combination of genomic features that make up PFE-the overlap between the two reading frames, a slippery sequence, as well as an ensemble of complex secondary structure elements-places severe constraints on this region as most possible nucleotide substitution may disrupt one or more of these elements. The vast amount of SARS-CoV-2 sequencing data generated within the past year provides an opportunity to assess the evolutionary dynamics of PFE in great detail.

View Article and Find Full Text PDF

The programmed frameshift element (PFE) rerouting translation from to is essential for propagation of coronaviruses. A combination of genomic features that make up PFE-the overlap between the two reading frames, a slippery sequence, as well as an ensemble of complex secondary structure elements-puts severe constraints on this region as most possible nucleotide substitution may disrupt one or more of these elements. The vast amount of SARS-CoV-2 sequencing data generated within the past year provides an opportunity to assess evolutionary dynamics of PFE in great detail.

View Article and Find Full Text PDF

Background: Significant progress has been made in advancing and standardizing tools for human genomic and biomedical research. Yet, the field of next-generation sequencing (NGS) analysis for microorganisms (including multiple pathogens) remains fragmented, lacks accessible and reusable tools, is hindered by local computational resource limitations, and does not offer widely accepted standards. One such "problem areas" is the analysis of Transposon Insertion Sequencing (TIS) data.

View Article and Find Full Text PDF

The COVID-19 pandemic is shifting teaching to an online setting all over the world. The Galaxy framework facilitates the online learning process and makes it accessible by providing a library of high-quality community-curated training materials, enabling easy access to data and tools, and facilitates sharing achievements and progress between students and instructors. By combining Galaxy with robust communication channels, effective instruction can be designed inclusively, regardless of the students' environments.

View Article and Find Full Text PDF

Sequencing technology has achieved great advances in the past decade. Studies have previously shown the quality of specific instruments in controlled conditions. Here, we developed a method able to retroactively determine the error rate of most public sequencing datasets.

View Article and Find Full Text PDF

The COVID-19 pandemic is the first global health crisis to occur in the age of big genomic data.Although data generation capacity is well established and sufficiently standardized, analytical capacity is not. To establish analytical capacity it is necessary to pull together global computational resources and deliver the best open source tools and analysis workflows within a ready to use, universally accessible resource.

View Article and Find Full Text PDF
Article Synopsis
  • - Modern biology is increasingly reliant on computational methods to handle the large and complex datasets that are emerging, posing a challenge for experimental biologists who may lack computational skills.
  • - Galaxy is a web-based platform that provides access to a variety of computational biology tools and public biological data repositories, allowing users to blend private and public datasets.
  • - The article offers detailed protocols for using Galaxy to conduct specific biological analyses, including finding human coding exons, analyzing ChIP-seq data, comparing datasets, and working with RNA-seq.
View Article and Find Full Text PDF

Duplex sequencing is currently the most reliable method to identify ultra-low frequency DNA variants by grouping sequence reads derived from the same DNA molecule into families with information on the forward and reverse strand. However, only a small proportion of reads are assembled into duplex consensus sequences (DCS), and reads with potentially valuable information are discarded at different steps of the bioinformatics pipeline, especially reads without a family. We developed a bioinformatics toolset that analyses the tag and family composition with the purpose to understand data loss and implement modifications to maximize the data output for the variant calling.

View Article and Find Full Text PDF

Background: The vast ecosystem of single-cell RNA-sequencing tools has until recently been plagued by an excess of diverging analysis strategies, inconsistent file formats, and compatibility issues between different software suites. The uptake of 10x Genomics datasets has begun to calm this diversity, and the bioinformatics community leans once more towards the large computing requirements and the statistically driven methods needed to process and understand these ever-growing datasets.

Results: Here we outline several Galaxy workflows and learning resources for single-cell RNA-sequencing, with the aim of providing a comprehensive analysis environment paired with a thorough user learning experience that bridges the knowledge gap between the computational methods and the underlying cell biology.

View Article and Find Full Text PDF

The current state of much of the Wuhan pneumonia virus (severe acute respiratory syndrome coronavirus 2 [SARS-CoV-2]) research shows a regrettable lack of data sharing and considerable analytical obfuscation. This impedes global research cooperation, which is essential for tackling public health emergencies and requires unimpeded access to data, analysis tools, and computational infrastructure. Here, we show that community efforts in developing open analytical software tools over the past 10 years, combined with national investments into scientific computational infrastructure, can overcome these deficiencies and provide an accessible platform for tackling global health emergencies in an open and transparent manner.

View Article and Find Full Text PDF