Summary: Estimating genome size using k-mer frequencies, which plays a fundamental role in designing genome sequencing and analysis projects, has remained challenging for polyploid species, i.e., ploidy p > 2.
View Article and Find Full Text PDFNoncoding RNAs (ncRNAs), including long noncoding RNAs (lncRNAs) and microRNAs (miRNAs), play crucial roles in gene expression regulation and are significant in disease associations and medical research. Accurate ncRNA-disease association prediction is essential for understanding disease mechanisms and developing treatments. Existing methods often focus on single tasks like lncRNA-disease associations (LDAs), miRNA-disease associations (MDAs), or lncRNA-miRNA interactions (LMIs), and fail to exploit heterogeneous graph characteristics.
View Article and Find Full Text PDFIntroduction: Because Alzheimer's disease (AD) has significant heterogeneity in encephalatrophy and clinical manifestations, AD research faces two critical challenges: eliminating the impact of natural aging and extracting valuable clinical data for patients with AD.
Methods: This study attempted to address these challenges by developing a novel machine-learning model called tensorized contrastive principal component analysis (T-cPCA). The objectives of this study were to predict AD progression and identify clinical subtypes while minimizing the influence of natural aging.
IEEE/ACM Trans Comput Biol Bioinform
August 2024
CircRNA is closely related to human disease, so it is important to predict circRNA-disease association (CDA). However, the traditional biological detection methods have high difficulty and low accuracy, and computational methods represented by deep learning ignore the ability of the model to explicitly extract local depth information of the CDA. We propose a model based on knowledge graph from recursion and attention aggregation for circRNA-disease association prediction (KGRACDA).
View Article and Find Full Text PDFIEEE J Biomed Health Inform
November 2023
Non-coding RNAs (ncRNAs) are a class of RNA molecules that lack the ability to encode proteins in human cells, but play crucial roles in various biological process. Understanding the interactions between different ncRNAs and their impact on diseases can significantly contribute to diagnosis, prevention, and treatment of diseases. However, predicting tertiary interactions between ncRNAs and diseases based on structural information in multiple scales remains a challenging task.
View Article and Find Full Text PDFRecent studies have demonstrated the significant role that circRNA plays in the progression of human diseases. Identifying circRNA-disease associations (CDA) in an efficient manner can offer crucial insights into disease diagnosis. While traditional biological experiments can be time-consuming and labor-intensive, computational methods have emerged as a viable alternative in recent years.
View Article and Find Full Text PDFIEEE Trans Neural Netw Learn Syst
September 2024
General graph neural networks (GNNs) implement convolution operations on graphs based on polynomial spectral filters. Existing filters with high-order polynomial approximations can detect more structural information when reaching high-order neighborhoods but produce indistinguishable representations of nodes, which indicates their inefficiency of processing information in high-order neighborhoods, resulting in performance degradation. In this article, we theoretically identify the feasibility of avoiding this problem and attribute it to overfitting polynomial coefficients.
View Article and Find Full Text PDFIEEE J Biomed Health Inform
February 2023
Long non-coding RNAs (LncRNAs) serve a vital role in regulating gene expressions and other biological processes. Differentiation of lncRNAs from protein-coding transcripts helps researchers dig into the mechanism of lncRNA formation and its downstream regulations related to various diseases. Previous works have been proposed to identify lncRNAs, including traditional bio-sequencing and machine learning approaches.
View Article and Find Full Text PDFNucleic Acids Res
November 2022
Multimodal single-cell sequencing technologies provide unprecedented information on cellular heterogeneity from multiple layers of genomic readouts. However, joint analysis of two modalities without properly handling the noise often leads to overfitting of one modality by the other and worse clustering results than vanilla single-modality analysis. How to efficiently utilize the extra information from single cell multi-omics to delineate cell states and identify meaningful signal remains as a significant computational challenge.
View Article and Find Full Text PDFTopologically associating domains (TADs) are fundamental building blocks of three dimensional genome, and organized into complex hierarchies. Identifying hierarchical TADs on Hi-C data helps to understand the relationship between genome architectures and gene regulation. Herein we propose TADfit, a multivariate linear regression model for profiling hierarchical chromatin domains, which tries to fit the interaction frequencies in Hi-C contact matrix with and without replicates using all-possible hierarchical TADs, and the significant ones can be determined by the regression coefficients obtained with the help of an online learning solver called Follow-The-Regularized-Leader (FTRL).
View Article and Find Full Text PDFBMC Bioinformatics
May 2022
Background: The zone adjacent to a transcription start site (TSS), namely, the promoter, is primarily involved in the process of DNA transcription initiation and regulation. As a result, proper promoter identification is critical for further understanding the mechanism of the networks controlling genomic regulation. A number of methodologies for the identification of promoters have been proposed.
View Article and Find Full Text PDFJ Bioinform Comput Biol
February 2022
The development of high-throughput technologies has produced increasing amounts of sequence data and an increasing need for efficient clustering algorithms that can process massive volumes of sequencing data for downstream analysis. Heuristic clustering methods are widely applied for sequence clustering because of their low computational complexity. Although numerous heuristic clustering methods have been developed, they suffer from two limitations: overestimation of inferred clusters and low clustering sensitivity.
View Article and Find Full Text PDFNucleic Acids Res
February 2022
For many RNA molecules, the secondary structure is essential for the correct function of the RNA. Predicting RNA secondary structure from nucleotide sequences is a long-standing problem in genomics, but the prediction performance has reached a plateau over time. Traditional RNA secondary structure prediction algorithms are primarily based on thermodynamic models through free energy minimization, which imposes strong prior assumptions and is slow to run.
View Article and Find Full Text PDFWith the rapid development of the Internet, readers tend to share their views and emotions about news events. Predicting these emotions provides a vital role in social media applications (e.g.
View Article and Find Full Text PDFComput Math Methods Med
November 2021
High-throughput data make it possible to study expression levels of thousands of genes simultaneously under a particular condition. However, only few of the genes are discriminatively expressed. How to identify these biomarkers precisely is significant for disease diagnosis, prognosis, and therapy.
View Article and Find Full Text PDFMotivation: Single-cell sequencing assay for transposase-accessible chromatin (scATAC-seq) provides new opportunities to dissect epigenomic heterogeneity and elucidate transcriptional regulatory mechanisms. However, computational modeling of scATAC-seq data is challenging due to its high dimension, extreme sparsity, complex dependencies and high sensitivity to confounding factors from various sources.
Results: Here, we propose a new deep generative model framework, named SAILER, for analyzing scATAC-seq data.
Characterizing genome-wide binding profiles of transcription factors (TFs) is essential for understanding biological processes. Although techniques have been developed to assess binding profiles within a population of cells, determining them at a single-cell level remains elusive. Here, we report scFAN (single-cell factor analysis network), a deep learning model that predicts genome-wide TF binding profiles in individual cells.
View Article and Find Full Text PDFMotivation: Identifying cis-acting genetic variants associated with gene expression levels-an analysis commonly referred to as expression quantitative trait loci (eQTLs) mapping-is an important first step toward understanding the genetic determinant of gene expression variation. Successful eQTL mapping requires effective control of confounding factors. A common method for confounding effects control in eQTL mapping studies is the probabilistic estimation of expression residual (PEER) analysis.
View Article and Find Full Text PDFIdentifying genetic variants that are associated with methylation variation-an analysis commonly referred to as methylation quantitative trait locus (mQTL) mapping-is important for understanding the epigenetic mechanisms underlying genotype-trait associations. Here, we develop a statistical method, IMAGE, for mQTL mapping in sequencing-based methylation studies. IMAGE properly accounts for the count nature of bisulfite sequencing data and incorporates allele-specific methylation patterns from heterozygous individuals to enable more powerful mQTL discovery.
View Article and Find Full Text PDFIEEE/ACM Trans Comput Biol Bioinform
December 2021
DNA methylation plays an important role in the regulation of some biological processes. Up to now, with the development of machine learning models, there are several sequence-based deep learning models designed to predict DNA methylation states, which gain better performance than traditional methods like random forest and SVM. However, convolutional network based deep learning models that use one-hot encoding DNA sequence as input may discover limited information and cause unsatisfactory prediction performance, so more data and model structures of diverse angles should be considered.
View Article and Find Full Text PDFCumulative evidence from biological experiments has confirmed that microRNAs (miRNAs) are related to many types of human diseases through different biological processes. It is anticipated that precise miRNA-disease association prediction could not only help infer potential disease-related miRNA but also boost human diagnosis and disease prevention. Considering the limitations of previous computational models, a more effective computational model needs to be implemented to predict miRNA-disease associations.
View Article and Find Full Text PDFIdentifying differentially expressed (DE) genes from RNA sequencing (RNAseq) studies is among the most common analyses in genomics. However, RNAseq DE analysis presents several statistical and computational challenges, including over-dispersed read counts and, in some settings, sample non-independence. Previous count-based methods rely on simple hierarchical Poisson models (e.
View Article and Find Full Text PDFBackground: Identifying transcription factors binding sites (TFBSs) plays an important role in understanding gene regulatory processes. The underlying mechanism of the specific binding for transcription factors (TFs) is still poorly understood. Previous machine learning-based approaches to identifying TFBSs commonly map a known TFBS to a one-dimensional vector using its physicochemical properties.
View Article and Find Full Text PDFComput Math Methods Med
March 2017
Gene regulatory networks (GRNs) play an important role in cellular systems and are important for understanding biological processes. Many algorithms have been developed to infer the GRNs. However, most algorithms only pay attention to the gene expression data but do not consider the topology information in their inference process, while incorporating this information can partially compensate for the lack of reliable expression data.
View Article and Find Full Text PDFStudies for the association between diseases and informative single nucleotide polymorphisms (SNPs) have received great attention. However, most of them just use the whole set of useful SNPs and fail to consider the SNP-SNP interactions, while these interactions have already been proven in biology experiments. In this paper, we use a binary particle swarm optimization with hierarchical structure (BPSOHS) algorithm to improve the effective of PSO for the identification of the SNP-SNP interactions.
View Article and Find Full Text PDF