A novel alignment-free vector method to cluster protein sequences.

J Theor Biol

Department of Mathematical Sciences, Tsinghua University, Beijing 100084, PR China. Electronic address:

Published: August 2017

Classification of protein are crucial topics in biology. The number of protein sequences stored in databases increases sharply in the past decade. Traditionally, comparison of protein sequences is usually carried out through multiple sequence alignment methods. However, these methods may be unsuitable for clustering of protein sequences when gene rearrangements occur such as in viral genomes. The computation is also very time-consuming for large datasets with long genomes. In this paper, based on three important biochemical properties of amino acids: the hydropathy index, polar requirement and chemical composition of the side chain, we propose a 24 dimensional feature vector describing the composition of amino acids in protein sequences. Our method not only utilizes the chemical properties of amino acids but also counts on their numbers and positions. The results on beta-globin, mammals, and three virus datasets show that this new tool is fast and accurate for classifying proteins and inferring the phylogeny of organisms.

Download full-text PDF

Source
http://dx.doi.org/10.1016/j.jtbi.2017.06.002DOI Listing

Publication Analysis

Top Keywords

protein sequences
20
amino acids
12
properties amino
8
protein
6
sequences
5
novel alignment-free
4
alignment-free vector
4
vector method
4
method cluster
4
cluster protein
4

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!