Identification of splice sites is imperative for prediction of gene structure. Machine learning-based approaches (MLAs) have been reported to be more successful than the rule-based methods for identification of splice sites. However, the strings of alphabets should be transformed into numeric features through sequence encoding before using them as input in MLAs. In this study, we evaluated the performances of 8 different sequence encoding schemes i.e., Bayes kernel, density and sparse (DS), distribution of tri-nucleotide and 1st order Markov model (DM), frequency difference distance measure (FDDM), paired-nucleotide frequency difference between true and false sites (FDTF), 1st order Markov model (MM1), combination of both 1st and 2nd order Markov model (MM1 + MM2) and 2nd order Markov model (MM2) in respect of predicting donor and acceptor splice sites using 5 supervised learning methods (ANN, Bagging, Boosting, RF and SVM). The encoding schemes and machine learning methods were first evaluated in 4 species i.e., A. thaliana, C. elegans, D. melanogaster and H. sapiens, and then performances were validated with another four species i.e., Ciona intestinalis, Dictyostelium discoideum, Phaeodactylum tricornutum and Trypanosoma brucei. In terms of ROC (receiver-operating-characteristics) and PR (precision-recall) curves, FDTF encoding approach achieved higher accuracy followed by either MM2 or FDDM. Further, SVM was found to achieve higher accuracy (in terms of ROC and PR curves) followed by RF across encoding schemes and species. In terms of prediction accuracy across species, the SVM-FDTF combination was optimum than other combinations of classifiers and encoding schemes. Further, splice site prediction accuracies were observed higher for the species with low intron density. To our limited knowledge, this is the first attempt as far as comprehensive evaluation of sequence encoding schemes for prediction of splice sites is concerned. We have also developed an R-package EncDNA (https://cran.r-project.org/web/packages/EncDNA/index.html) for encoding of splice site motifs with different encoding schemes, which is expected to supplement the existing nucleotide sequence encoding approaches. This study is believed to be useful for the computational biologists for predicting different functional elements on the genomic DNA.

Download full-text PDF

Source
http://dx.doi.org/10.1016/j.gene.2019.04.047DOI Listing

Publication Analysis

Top Keywords

encoding schemes
28
sequence encoding
20
splice sites
20
order markov
16
markov model
16
learning methods
12
encoding
11
schemes machine
8
machine learning
8
identification splice
8

Similar Publications

Quantum walks on photonic platforms represent a physics-rich framework for quantum measurements, simulations and universal computing. Dynamic reconfigurability of photonic circuitry is key to controlling the walk and retrieving its full operation potential. Universal quantum processing schemes based on time-bin encoding in gated fibre loops have been proposed but not demonstrated yet, mainly due to gate inefficiencies.

View Article and Find Full Text PDF

A sweeping view of avian mycoplasmas biology drawn from comparative genomic analyses.

BMC Genomics

January 2025

Unit of Mycoplasmas, Laboratory of Molecular Microbiology, Vaccinology and Biotechnology Development, Institut Pasteur de Tunis, University Tunis El Manar, Tunis, Tunisia.

Background: Avian mycoplasmas are small bacteria associated with several pathogenic conditions in many wild and poultry bird species. Extensive genomic data are available for many avian mycoplasmas, yet no comparative studies focusing on this group of mycoplasmas have been undertaken so far.

Results: Here, based on the comparison of forty avian mycoplasma genomes belonging to ten different species, we provide insightful information on the phylogeny, pan/core genome, energetic metabolism, and virulence of these avian pathogens.

View Article and Find Full Text PDF

Privacy-preserving record linkage (PPRL) technology, crucial for linking records across datasets while maintaining privacy, is susceptible to graph-based re-identification attacks. These attacks compromise privacy and pose significant risks, such as identity theft and financial fraud. This study proposes a zero-relationship encoding scheme that minimizes the linkage between source and encoded records to enhance PPRL systems' resistance to re-identification attacks.

View Article and Find Full Text PDF

Anticounterfeiting technologies meet challenges in the Internet of Things era due to the rapidly growing volume of objects, their frequent connection with humans, and the accelerated advance of counterfeiting/cracking techniques. Here, we, inspired by biological fingerprints, present a simple anticounterfeiting system based on perovskite quantum dot (PQD) fingerprint physical unclonable function (FPUF) by cooperatively utilizing the spontaneous-phase separation of polymers and selective in situ synthesis PQDs as an entropy source. The FPUFs offer red, green, and blue full-color fingerprint identifiers and random three-dimensional (3D) morphology, which extends binary to multivalued encoding by tuning the perovskite and polymer components, enabling a high encoding capacity (about 10, far surpassing that of biometric fingerprints).

View Article and Find Full Text PDF

Image segmentation is a crucial task in artificial intelligence fields such as computer vision and medical imaging. While convolutional neural networks (CNNs) have achieved notable success by learning representative features from large datasets, they often lack geometric priors and global object information, limiting their accuracy in complex scenarios. Variational methods like active contours provide geometric priors and theoretical interpretability but require manual initialization and are sensitive to hyper-parameters.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!