Inaccurate inference of positional homologies in multiple sequence alignments and systematic errors introduced by alignment heuristics obfuscate phylogenetic inference. Alignment masking, the elimination of phylogenetically uninformative or misleading sites from an alignment before phylogenetic analysis, is a common practice in phylogenetic analysis. Although masking is often done manually, automated methods are necessary to handle the much larger data sets being prepared today. In this study, we introduce the concept of subsplits and demonstrate their use in extracting phylogenetic signal from alignments. We design a clustering approach for alignment masking where each cluster contains similar columns-similarity being defined on the basis of compatible subsplits; our approach then identifies noisy clusters and eliminates them. Trees inferred from the columns in the retained clusters are found to be topologically closer to the reference trees. We test our method on numerous standard benchmarks (both synthetic and biological data sets) and compare its performance with other methods of alignment masking. We find that our method can eliminate sites more accurately than other methods, particularly on divergent data, and can improve the topologies of the inferred trees in likelihood-based analyses. Software available upon request from the author.

Download full-text PDF

Source
http://dx.doi.org/10.1093/molbev/mss264DOI Listing

Publication Analysis

Top Keywords

alignment masking
16
phylogenetic signal
8
multiple sequence
8
sequence alignments
8
phylogenetic analysis
8
data sets
8
masking
5
phylogenetic
5
alignment
5
method alignment
4

Similar Publications

Hypermutated proviruses, which arise in a single Human Immunodeficiency Virus (HIV) replication cycle when host antiviral APOBEC3 proteins introduce extensive guanine to adenine mutations throughout the viral genome, persist in all people living with HIV receiving antiretroviral therapy (ART). However, hypermutated sequences are routinely excluded from phylogenetic trees because their extensive mutations complicate phylogenetic inference, and as a result, we know relatively little about their within-host evolutionary origins and dynamics. Using >1400 longitudinal single-genome-amplified HIV sequences isolated from six women over a median of 18 years of follow-up-including plasma HIV RNA sequences collected over a median of 9 years between seroconversion and ART initiation, and >500 proviruses isolated over a median of 9 years on ART-we evaluated three approaches for masking hypermutation in nucleotide alignments.

View Article and Find Full Text PDF

We investigate alternative strategies against reference bias and postmortem damage in low coverage paleogenomes. Compared to alignment to the linear reference genome, we show that masking known polymorphic sites and graph alignment effectively remove reference bias, but only starting from raw read files. We next study approaches to overcome postmortem damage: trimming, rescaling, and our newly developed algorithm, bamRefine (github.

View Article and Find Full Text PDF

Clinical decision-making is driven by multimodal data, including clinical notes and pathological characteristics. Artificial intelligence approaches that can effectively integrate multimodal data hold significant promise in advancing clinical care. However, the scarcity of well-annotated multimodal datasets in clinical settings has hindered the development of useful models.

View Article and Find Full Text PDF

Contrastive learning of T cell receptor representations.

Cell Syst

December 2024

Division of Infection and Immunity, University College London, London WC1E 6BT, UK; Institute for the Physics of Living Systems, University College London, London WC1E 6BT, UK. Electronic address:

Computational prediction of the interaction of T cell receptors (TCRs) and their ligands is a grand challenge in immunology. Despite advances in high-throughput assays, specificity-labeled TCR data remain sparse. In other domains, the pre-training of language models on unlabeled data has been successfully used to address data bottlenecks.

View Article and Find Full Text PDF

MPicker: visualizing and picking membrane proteins for cryo-electron tomography.

Nat Commun

January 2025

Key Laboratory for Protein Sciences of Ministry of Education, School of Life Sciences, Tsinghua University, Beijing, China.

Advancements in cryo-electron tomography (cryoET) allow the structure of macromolecules to be determined in situ, which is crucial for studying membrane protein structures and their interactions in the cellular environment. However, membranes are often highly curved and have a strong contrast in cryoET tomograms, which masks the signals from membrane proteins. These factors pose difficulties in observing and revealing the structures of membrane proteins in situ.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!