Oligomers of length k, or k-mers, are convenient and widely used features for modeling the properties and functions of DNA and protein sequences. However, k-mers suffer from the inherent limitation that if the parameter k is increased to resolve longer features, the probability of observing any specific k-mer becomes very small, and k-mer counts approach a binary variable, with most k-mers absent and a few present once. Thus, any statistical learning approach using k-mers as features becomes susceptible to noisy training set k-mer frequencies once k becomes large. To address this problem, we introduce alternative feature sets using gapped k-mers, a new classifier, gkm-SVM, and a general method for robust estimation of k-mer frequencies. To make the method applicable to large-scale genome wide applications, we develop an efficient tree data structure for computing the kernel matrix. We show that compared to our original kmer-SVM and alternative approaches, our gkm-SVM predicts functional genomic regulatory elements and tissue specific enhancers with significantly improved accuracy, increasing the precision by up to a factor of two. We then show that gkm-SVM consistently outperforms kmer-SVM on human ENCODE ChIP-seq datasets, and further demonstrate the general utility of our method using a Naïve-Bayes classifier. Although developed for regulatory sequence analysis, these methods can be applied to any sequence classification problem.
Download full-text PDF |
Source |
---|---|
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4102394 | PMC |
http://dx.doi.org/10.1371/journal.pcbi.1003711 | DOI Listing |
PLoS Biol
January 2025
Department of Molecular Biology, Massachusetts General Hospital, Boston, Massachusetts, United States of America.
RNA interference (RNAi) mediates antiviral defense in many eukaryotes. Caenorhabditis elegans mutants that disable RNAi are more sensitive to viral infection. Many mutants that enhance RNAi have also been identified; these mutations may reveal genes that are normally down-regulated in antiviral defense.
View Article and Find Full Text PDFPLoS One
January 2025
Department of Computing and Mathematics, Manchester Metropolitan University, Manchester, United Kingdom.
Many machine learning techniques have been used to construct gene regulatory networks (GRNs) through precision matrix that considers conditional independence among genes, and finally produces sparse version of GRNs. This construction can be improved using the auxiliary information like gene expression profile of the related species or gene markers. To reach out this goal, we apply a generalized linear model (GLM) in first step and later a penalized maximum likelihood to construct the gene regulatory network using Glasso technique for the residuals of a multi-level multivariate GLM among the gene expressions of one species as a multi-levels response variable and the gene expression of related species as a multivariate covariates.
View Article and Find Full Text PDFDis Model Mech
January 2025
Department of Molecular Biology, Faculty of Science, Radboud Institute for Molecular Life Science, Radboud University, Nijmegen 6525GA, The Netherlands.
Hepatic organoid cultures are a powerful model to study liver development and diseases in vitro. However, hepatocyte-like cells differentiated from these organoids remain immature compared to primary human hepatocytes (PHHs), which are the benchmark in the field. Here, we applied integrative single-cell transcriptome and chromatin accessibility analysis to reveal gene regulatory mechanisms underlying these differences.
View Article and Find Full Text PDFCochrane Database Syst Rev
January 2025
Global Health Nursing, Graduate School of Nursing Science, St. Luke's International University, Chuo-ku, Japan.
Background: Antimicrobial resistance (AMR) is a major global health concern. One of the most important causes of AMR is the excessive and inappropriate use of antimicrobial drugs in healthcare and community settings. Most countries have policies that require antimicrobial drugs to be obtained from a pharmacy by prescription.
View Article and Find Full Text PDFMol Biol Evol
January 2025
Laboratório de Algoritmos em Biologia, Departamento de Genética, Ecologia e Evolução, Instituto de Ciências Biológicas, Universidade Federal de Minas Gerais, Brazil.
A key trait of Eukarya is the independent evolution of complex multicellular (CM) in animals, plants, fungi, brown algae and red algae. This phenotype is characterized by the initial exaptation of cell-cell adhesion genes followed by the emergence of mechanisms for cell-cell communication, together with the expansion of transcription factor gene families responsible for cell and tissue identity. The number of cell types (NCT) is commonly used as a quantitative proxy for biological complexity in comparative genomics studies.
View Article and Find Full Text PDFEnter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!