Sequence-based prediction of protein-binding sites in DNA: comparative study of two SVM models.

Comput Methods Programs Biomed

Department of Computer Science and Engineering, Inha University, Incheon, South Korea. Electronic address:

Published: November 2014

As many structures of protein-DNA complexes have been known in the past years, several computational methods have been developed to predict DNA-binding sites in proteins. However, its inverse problem (i.e., predicting protein-binding sites in DNA) has received much less attention. One of the reasons is that the differences between the interaction propensities of nucleotides are much smaller than those between amino acids. Another reason is that DNA exhibits less diverse sequence patterns than protein. Therefore, predicting protein-binding DNA nucleotides is much harder than predicting DNA-binding amino acids. We computed the interaction propensity (IP) of nucleotide triplets with amino acids using an extensive dataset of protein-DNA complexes, and developed two support vector machine (SVM) models that predict protein-binding nucleotides from sequence data alone. One SVM model predicts protein-binding nucleotides using DNA sequence data alone, and the other SVM model predicts protein-binding nucleotides using both DNA and protein sequences. In a 10-fold cross-validation with 1519 DNA sequences, the SVM model that uses DNA sequence data only predicted protein-binding nucleotides with an accuracy of 67.0%, an F-measure of 67.1%, and a Matthews correlation coefficient (MCC) of 0.340. With an independent dataset of 181 DNAs that were not used in training, it achieved an accuracy of 66.2%, an F-measure 66.3% and a MCC of 0.324. Another SVM model that uses both DNA and protein sequences achieved an accuracy of 69.6%, an F-measure of 69.6%, and a MCC of 0.383 in a 10-fold cross-validation with 1519 DNA sequences and 859 protein sequences. With an independent dataset of 181 DNAs and 143 proteins, it showed an accuracy of 67.3%, an F-measure of 66.5% and a MCC of 0.329. Both in cross-validation and independent testing, the second SVM model that used both DNA and protein sequence data showed better performance than the first model that used DNA sequence data. To the best of our knowledge, this is the first attempt to predict protein-binding nucleotides in a given DNA sequence from the sequence data alone.

Download full-text PDF

Source
http://dx.doi.org/10.1016/j.cmpb.2014.07.009DOI Listing

Publication Analysis

Top Keywords

sequence data
24
protein-binding nucleotides
20
svm model
20
dna sequence
16
model dna
16
dna
13
amino acids
12
nucleotides dna
12
dna protein
12
protein sequences
12

Similar Publications

The Antibody Mediated Prevention (AMP) trials showed that passively infused VRC01, a broadly neutralizing antibody (bNAb) targeting the CD4 binding site (CD4bs) on the HIV-1 envelope protein (Env), protected against neutralization-sensitive viruses. We identified six individuals from the VRC01 treatment arm with multi-lineage breakthrough HIV-1 infections from HVTN703, where one variant was sensitive to VRC01 (IC < 25 ug/mL) but another was resistant. By comparing Env sequences of resistant and sensitive clones from each participant, we identified sites predicted to affect VRC01 neutralization and assessed the effect of their reversion in the VRC01-resistant clone on neutralization sensitivity.

View Article and Find Full Text PDF

is a putative producer of polyunsaturated fatty acids in the gut soil of the composting earthworm .

Appl Environ Microbiol

January 2025

Centre for Microbiology and Environmental Systems Science, Division of Microbial Ecology, University of Vienna, Vienna, Austria.

Polyunsaturated fatty acids (PUFAs) play a crucial role in aiding bacteria to adapt to extreme and stressful environments. While there is a well-established understanding of their production, accrual, and transfer within marine ecosystems, knowledge about terrestrial environments remains limited. Investigation of the intestinal microbiome of earthworms has illuminated the presence of PUFAs presumably of microbial origin, which contrasts with the surrounding soil.

View Article and Find Full Text PDF

MultiTax-human: an extensive and high-resolution human-related full-length 16S rRNA reference database and taxonomy.

Microbiol Spectr

January 2025

State Key Laboratory for Diagnosis and Treatment of Infectious Diseases, National Clinical Research Center for Infectious Diseases, National Medical Center for Infectious Diseases, Collaborative Innovation Center for Diagnosis and Treatment of Infectious Diseases, The First Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou, China.

Considering that the human microbiota plays a critical role in health and disease, an accurate and high-resolution taxonomic classification is thus essential for meaningful microbiome analysis. In this study, we developed an automatic system, named MultiTax pipeline, for generating taxonomy from full-length 16S rRNA sequences using the Genome Taxonomy Database and other existing reference databases. We first constructed the MultiTax-human database, a high-resolution resource specifically designed for human microbiome research and clinical applications.

View Article and Find Full Text PDF

Previous studies in sports science suggested that regular exercise has a positive impact on human health. However, the effects of endurance sports and their underlying mechanisms are still not completely understood. One of the main debates regards the modulation of immune dynamics in high-intensity exercise.

View Article and Find Full Text PDF

Successful Diagnosis of Sengers Syndrome Using a Comprehensive Genomic Analysis.

Mol Genet Genomic Med

January 2025

Diagnostics and Therapeutics of Intractable Diseases, Intractable Disease Research Center, Graduate School of Medicine, Juntendo University, Tokyo, Japan.

Background: Sengers syndrome is an autosomal recessive mitochondrial DNA depletion syndrome characterized by hypertrophic cardiomyopathy, congenital cataracts, skeletal myopathy, exercise intolerance, and lactic acidosis. Dysfunction of acylglycerol kinase (AGK) is responsible for the disease, and several AGK gene variants have been reported.

Methods: We employed a comprehensive genomic analysis approach, including whole-genome sequencing and RNA sequencing, combined with various bioinformatics tools.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!