Motivation: Multiple sequence alignments are extensively used in biology, from phylogenetic reconstruction to structure and function prediction. Here, we suggest an out-of-the-box approach for the inference of multiple sequence alignments, which relies on algorithms developed for processing natural languages. We show that our AI-based methodology can be trained to align sequences by processing alignments that are generated via simulations, and thus different aligners can be easily generated for datasets with specific evolutionary dynamics attributes. We expect that natural-language processing solutions will replace or augment classic solutions for computing alignments, and more generally, challenging inference tasks in phylogenomics.
Results: The multiple sequence alignment (MSA) problem is a fundamental pillar in bioinformatics, comparative genomics, and phylogenetics. Here we characterize and improve BetaAlign, the first deep learning aligner, which substantially deviates from conventional algorithms of alignment computation. BetaAlign draws on natural language processing (NLP) techniques and trains transformers to map a set of unaligned biological sequences to an MSA. We show that our approach is highly accurate, comparable and sometimes better than state-of-the-art alignment tools. We characterize the performance of BetaAlign and the effect of various aspects on accuracy; for example, the size of the training data, the effect of different transformer architectures, and the effect of learning on a subspace of indel-model parameters (subspace learning). We also introduce a new technique that leads to improved performance compared to our previous approach. Our findings further uncover the potential of NLP-based methods for sequence alignment, highlighting that AI-based algorithms can substantially challenge classic approaches in phylogenomics and bioinformatics.
Availability: Datasets used in this work are available on HuggingFace (Wolf et al., 2020) at: https://huggingface.co/dotan1111. Source code is available at: https://github.com/idotan286/SimulateAlignments.
Supplementary Information: Supplementary data are available at Bioinformatics online.
Download full-text PDF |
Source |
---|---|
http://dx.doi.org/10.1093/bioinformatics/btaf009 | DOI Listing |
Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!