Enhanced regulatory sequence prediction using gapped k-mer features.

PLoS Comput Biol

Department of Biomedical Engineering, Johns Hopkins University, Baltimore, Maryland, United States of America; McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins University, Baltimore, Maryland, United States of America.

Published: July 2014

Oligomers of length k, or k-mers, are convenient and widely used features for modeling the properties and functions of DNA and protein sequences. However, k-mers suffer from the inherent limitation that if the parameter k is increased to resolve longer features, the probability of observing any specific k-mer becomes very small, and k-mer counts approach a binary variable, with most k-mers absent and a few present once. Thus, any statistical learning approach using k-mers as features becomes susceptible to noisy training set k-mer frequencies once k becomes large. To address this problem, we introduce alternative feature sets using gapped k-mers, a new classifier, gkm-SVM, and a general method for robust estimation of k-mer frequencies. To make the method applicable to large-scale genome wide applications, we develop an efficient tree data structure for computing the kernel matrix. We show that compared to our original kmer-SVM and alternative approaches, our gkm-SVM predicts functional genomic regulatory elements and tissue specific enhancers with significantly improved accuracy, increasing the precision by up to a factor of two. We then show that gkm-SVM consistently outperforms kmer-SVM on human ENCODE ChIP-seq datasets, and further demonstrate the general utility of our method using a Naïve-Bayes classifier. Although developed for regulatory sequence analysis, these methods can be applied to any sequence classification problem.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4102394PMC
http://dx.doi.org/10.1371/journal.pcbi.1003711DOI Listing

Publication Analysis

Top Keywords

regulatory sequence
8
k-mer frequencies
8
k-mer
5
k-mers
5
enhanced regulatory
4
sequence prediction
4
prediction gapped
4
gapped k-mer
4
features
4
k-mer features
4

Similar Publications

RNA interference (RNAi) mediates antiviral defense in many eukaryotes. Caenorhabditis elegans mutants that disable RNAi are more sensitive to viral infection. Many mutants that enhance RNAi have also been identified; these mutations may reveal genes that are normally down-regulated in antiviral defense.

View Article and Find Full Text PDF

Many machine learning techniques have been used to construct gene regulatory networks (GRNs) through precision matrix that considers conditional independence among genes, and finally produces sparse version of GRNs. This construction can be improved using the auxiliary information like gene expression profile of the related species or gene markers. To reach out this goal, we apply a generalized linear model (GLM) in first step and later a penalized maximum likelihood to construct the gene regulatory network using Glasso technique for the residuals of a multi-level multivariate GLM among the gene expressions of one species as a multi-levels response variable and the gene expression of related species as a multivariate covariates.

View Article and Find Full Text PDF

Multi-omics analysis reveals distinct gene regulatory mechanisms between primary and organoid-derived human hepatocytes.

Dis Model Mech

January 2025

Department of Molecular Biology, Faculty of Science, Radboud Institute for Molecular Life Science, Radboud University, Nijmegen 6525GA, The Netherlands.

Hepatic organoid cultures are a powerful model to study liver development and diseases in vitro. However, hepatocyte-like cells differentiated from these organoids remain immature compared to primary human hepatocytes (PHHs), which are the benchmark in the field. Here, we applied integrative single-cell transcriptome and chromatin accessibility analysis to reveal gene regulatory mechanisms underlying these differences.

View Article and Find Full Text PDF

Interventions to reduce non-prescription antimicrobial sales in community pharmacies.

Cochrane Database Syst Rev

January 2025

Global Health Nursing, Graduate School of Nursing Science, St. Luke's International University, Chuo-ku, Japan.

Background: Antimicrobial resistance (AMR) is a major global health concern. One of the most important causes of AMR is the excessive and inappropriate use of antimicrobial drugs in healthcare and community settings. Most countries have policies that require antimicrobial drugs to be obtained from a pharmacy by prescription.

View Article and Find Full Text PDF

Molecular and functional convergences associated with complex multicellularity in Eukarya.

Mol Biol Evol

January 2025

Laboratório de Algoritmos em Biologia, Departamento de Genética, Ecologia e Evolução, Instituto de Ciências Biológicas, Universidade Federal de Minas Gerais, Brazil.

A key trait of Eukarya is the independent evolution of complex multicellular (CM) in animals, plants, fungi, brown algae and red algae. This phenotype is characterized by the initial exaptation of cell-cell adhesion genes followed by the emergence of mechanisms for cell-cell communication, together with the expansion of transcription factor gene families responsible for cell and tissue identity. The number of cell types (NCT) is commonly used as a quantitative proxy for biological complexity in comparative genomics studies.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!