N4-methylcytosine (4 mC) is an important and common methylation which widely exists in prokaryotes. It plays a crucial role in correcting DNA replication errors and protecting host DNA against degradation by restrictive enzymes. Hence, the accurate identification for 4 mC sites is greatly significant for understanding biological functions and treating gene diseases. In this paper, a novel model is designed for identifying 4 mC sites. Firstly, we extract features from original sequences by multi-source feature representation methods, which are mono-nucleotide binary and k-mer frequency, dinucleotide binary and position-specific frequency, ring-function-hydrogen-chemical properties, dinucleotide-based DNA properties and trinucleotide-based DNA properties. Subsequently, gradient boosting decision tree is applied to select the optimal feature set and remove redundant information. Finally, support vector machine is employed to predict 4 mC or non-4mC sites. The accuracies of six datasets reach 0.851, 0.859, 0.801, 0.87, 0.859 and 0.901, respectively, which are superior to previous prediction methods. Therefore, the results show that our predictor is a feasible and effective tool for identifying 4 mC sites. Furthermore, an online web server is established at http://dnan4c.zhanglab.site.
Download full-text PDF |
Source |
---|---|
http://dx.doi.org/10.1016/j.ab.2022.114746 | DOI Listing |
Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!