Objective: BLOSUM matrices serve as standard matrices for many protein sequence alignment programs. BLOSUM matrices have been constructed using BLOCKS version with 27,102 BLOCKS, whereas the latest updated version has 6,739,916 BLOCKS. We read with interest the research article by Hess et al. (BMC Bioinform 17:189, 2016) on CorBLOSUM, wherein it is argued that an inaccuracy in the BLOSUM code affects the cluster memberships of sequences. They show that replacing the integer based clustering threshold to floating point arguably improves the performances of CorBLOSUM over BLOSUM and RBLOSUM matrices. They compare BLOSUM62 against RBLOSUM69, with relative entropies of 0.2685 and 0.2662 respectively. The present work attempts to repeat the computation to verify the respective analog matrices.
Results: In our attempt to repeat the computation, we observed that the relative entropy of BLOSUM62 is 0.2360 and BLOSUM50 is 0.1198. As only matrices of similar entropies can be compared, BLOSUM62 can be compared only with RBLOSUM66 and BLOSUM50 can be compared only with RBLOSUM56. We conducted experiments with Astral data sets, and demonstrated the improved accuracy in the coverage. Our results imply that RBLOSUM performs statistically better than CorBLOSUM and BLOSUM matrices.
Download full-text PDF |
Source |
---|---|
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5963171 | PMC |
http://dx.doi.org/10.1186/s13104-018-3415-5 | DOI Listing |
Avicenna J Med Biotechnol
January 2024
Department of Molecular Virology, Pasteur Institute of Iran, Tehran, Iran.
Brief Bioinform
March 2024
Department of Computer Science, University of Western Ontario, London, N6A 5B7, Ontario, Canada.
Sequence similarity is of paramount importance in biology, as similar sequences tend to have similar function and share common ancestry. Scoring matrices, such as PAM or BLOSUM, play a crucial role in all bioinformatics algorithms for identifying similarities, but have the drawback that they are fixed, independent of context. We propose a new scoring method for amino acid similarity that remedies this weakness, being contextually dependent.
View Article and Find Full Text PDFBMC Bioinformatics
February 2024
Luddy School of Informatics, Computing and Engineering, Indiana University, 700 N. Woodlawn Avenue, Bloomington, IN, 47408, USA.
Purpose: Despite the many progresses with alignment algorithms, aligning divergent protein sequences with less than 20-35% pairwise identity (so called "twilight zone") remains a difficult problem. Many alignment algorithms have been using substitution matrices since their creation in the 1970's to generate alignments, however, these matrices do not work well to score alignments within the twilight zone. We developed Protein Embedding based Alignments, or PEbA, to better align sequences with low pairwise identity.
View Article and Find Full Text PDFBiosystems
November 2023
Faculty of Kinesiology, University of Zagreb, Horvaćanski zavoj 15, HR-10000 Zagreb, Croatia.
Phylogenetics is the study of ancestral relationships among biological species. Such sequence analyses are often represented as phylogenetic trees. The branching pattern of each tree and its topology reflect the evolutionary relatedness between analyzed sequences.
View Article and Find Full Text PDFBioinformatics
June 2022
Department of Data Science and Artificial Intelligence, Faculty of Information Technology, Monash University, Clayton, VIC 3800, Australia.
Summary: Sequences of proteins evolve by accumulating substitutions together with insertions and deletions (indels) of amino acids. However, it remains a common practice to disconnect substitutions and indels, and infer approximate models for each of them separately, to quantify sequence relationships. Although this approach brings with it computational convenience (which remains its primary motivation), there is a dearth of attempts to unify and model them systematically and together.
View Article and Find Full Text PDFEnter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!