Evaluating the performance of sequence encoding schemes and machine learning methods for splice sites recognition.

Prabina Kumar Meher Tanmaya Kumar Sahu Shachi Gahoi Subhrajit Satpathy Atmakuri Ramakrishna Rao

Gene

ICAR-Indian Agricultural Statistics Research Institute, New Delhi 110012, India. Electronic address:

Published: July 2019

Identification of splice sites is imperative for prediction of gene structure. Machine learning-based approaches (MLAs) have been reported to be more successful than the rule-based methods for identification of splice sites. However, the strings of alphabets should be transformed into numeric features through sequence encoding before using them as input in MLAs. In this study, we evaluated the performances of 8 different sequence encoding schemes i.e., Bayes kernel, density and sparse (DS), distribution of tri-nucleotide and 1st order Markov model (DM), frequency difference distance measure (FDDM), paired-nucleotide frequency difference between true and false sites (FDTF), 1st order Markov model (MM1), combination of both 1st and 2nd order Markov model (MM1 + MM2) and 2nd order Markov model (MM2) in respect of predicting donor and acceptor splice sites using 5 supervised learning methods (ANN, Bagging, Boosting, RF and SVM). The encoding schemes and machine learning methods were first evaluated in 4 species i.e., A. thaliana, C. elegans, D. melanogaster and H. sapiens, and then performances were validated with another four species i.e., Ciona intestinalis, Dictyostelium discoideum, Phaeodactylum tricornutum and Trypanosoma brucei. In terms of ROC (receiver-operating-characteristics) and PR (precision-recall) curves, FDTF encoding approach achieved higher accuracy followed by either MM2 or FDDM. Further, SVM was found to achieve higher accuracy (in terms of ROC and PR curves) followed by RF across encoding schemes and species. In terms of prediction accuracy across species, the SVM-FDTF combination was optimum than other combinations of classifiers and encoding schemes. Further, splice site prediction accuracies were observed higher for the species with low intron density. To our limited knowledge, this is the first attempt as far as comprehensive evaluation of sequence encoding schemes for prediction of splice sites is concerned. We have also developed an R-package EncDNA (https://cran.r-project.org/web/packages/EncDNA/index.html) for encoding of splice site motifs with different encoding schemes, which is expected to supplement the existing nucleotide sequence encoding approaches. This study is believed to be useful for the computational biologists for predicting different functional elements on the genomic DNA.

Download full-text PDF	Source
http://dx.doi.org/10.1016/j.gene.2019.04.047	DOI Listing

Publication Analysis

Top Keywords

encoding schemes

sequence encoding

splice sites

order markov

markov model

learning methods

encoding

schemes machine

machine learning

identification splice

Similar Publications

Quantum state processing through controllable synthetic temporal photonic lattices.

Nat Photonics

October 2024

Institut national de la recherche scientifique, Centre Énergie Matériaux Télécommunications, Varennes, Quebec Canada.

Monika Monika Farzam Nosrati Agnes George Stefania Sciara Riza Fazili

Quantum walks on photonic platforms represent a physics-rich framework for quantum measurements, simulations and universal computing. Dynamic reconfigurability of photonic circuitry is key to controlling the walk and retrieving its full operation potential. Universal quantum processing schemes based on time-bin encoding in gated fibre loops have been proposed but not demonstrated yet, mainly due to gate inefficiencies.

View Article and Find Full Text PDF

Similar Publications

A sweeping view of avian mycoplasmas biology drawn from comparative genomic analyses.

BMC Genomics

January 2025

Unit of Mycoplasmas, Laboratory of Molecular Microbiology, Vaccinology and Biotechnology Development, Institut Pasteur de Tunis, University Tunis El Manar, Tunis, Tunisia.

Elhem Yacoub Vincent Baby Pascal Sirand-Pugnet Yonathan Arfi Helmi Mardassi

Background: Avian mycoplasmas are small bacteria associated with several pathogenic conditions in many wild and poultry bird species. Extensive genomic data are available for many avian mycoplasmas, yet no comparative studies focusing on this group of mycoplasmas have been undertaken so far.

Results: Here, based on the comparison of forty avian mycoplasma genomes belonging to ten different species, we provide insightful information on the phylogeny, pan/core genome, energetic metabolism, and virulence of these avian pathogens.

View Article and Find Full Text PDF

Similar Publications

Secure privacy-preserving record linkage system from re-identification attack.

PLoS One

January 2025

Department of Computer Science and Engineering at Hanyang University ERICA, Ansan-si, Gyeonggi-do, South Korea.

Sejong Lee Yushin Kim Yongseok Kwon Sunghyun Cho

Privacy-preserving record linkage (PPRL) technology, crucial for linking records across datasets while maintaining privacy, is susceptible to graph-based re-identification attacks. These attacks compromise privacy and pose significant risks, such as identity theft and financial fraud. This study proposes a zero-relationship encoding scheme that minimizes the linkage between source and encoded records to enhance PPRL systems' resistance to re-identification attacks.

View Article and Find Full Text PDF

Similar Publications

Biomimetic Fingerprint-like Unclonable Optical Anticounterfeiting System with Selectively In Situ-Synthesized Perovskite Quantum Dots Embedded in Spontaneous-Phase-Separated Polymers.

ACS Appl Mater Interfaces

January 2025

Institute of Optoelectronic Technology, Fuzhou University, Fuzhou 350116, China.

Kejia You Jiasong Lin Zhen Wang Yi Jiang Jiayu Sun

Anticounterfeiting technologies meet challenges in the Internet of Things era due to the rapidly growing volume of objects, their frequent connection with humans, and the accelerated advance of counterfeiting/cracking techniques. Here, we, inspired by biological fingerprints, present a simple anticounterfeiting system based on perovskite quantum dot (PQD) fingerprint physical unclonable function (FPUF) by cooperatively utilizing the spontaneous-phase separation of polymers and selective in situ synthesis PQDs as an entropy source. The FPUFs offer red, green, and blue full-color fingerprint identifiers and random three-dimensional (3D) morphology, which extends binary to multivalued encoding by tuning the perovskite and polymer components, enabling a high encoding capacity (about 10, far surpassing that of biometric fingerprints).

View Article and Find Full Text PDF

Similar Publications

A Multi-Source Circular Geodesic Voting Model for Image Segmentation.

Entropy (Basel)

December 2024

Shandong Artificial Intelligence Institute, Qilu University of Technology (Shandong Academy of Sciences), Jinan 250014, China.

Shuwang Zhou Minglei Shu Chong Di

Image segmentation is a crucial task in artificial intelligence fields such as computer vision and medical imaging. While convolutional neural networks (CNNs) have achieved notable success by learning representative features from large datasets, they often lack geometric priors and global object information, limiting their accuracy in complex scenarios. Variational methods like active contours provide geometric priors and theoretical interpretability but require manual initialization and are sensitive to hyper-parameters.

View Article and Find Full Text PDF

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!