Finding the causal structure from a set of variables given observational data is a crucial task in many scientific areas. Most algorithms focus on discovering the global causal graph but few efforts have been made toward the local causal structure (LCS), which is of wide practical significance and easier to obtain. LCS learning faces the challenges of neighborhood determination and edge orientation.
View Article and Find Full Text PDFMotivation: High-resolution annotation of gene functions is a central task in functional genomics. Multiple proteoforms translated from alternatively spliced isoforms from a single gene are actual function performers and greatly increase the functional diversity. The specific functions of different isoforms can decipher the molecular basis of various complex diseases at a finer granularity.
View Article and Find Full Text PDFPersonalized federated learning (PFL) learns a personalized model for each client in a decentralized manner, where each client owns private data that are not shared and data among clients are non-independent and identically distributed (i.i.d.
View Article and Find Full Text PDFMotivation: Alternative splicing creates the considerable proteomic diversity and complexity on relatively limited genome. Proteoforms translated from alternatively spliced isoforms of a gene actually execute the biological functions of this gene, which reflect the functional knowledge of genes at a finer granular level. Recently, some computational approaches have been proposed to differentiate isoform functions using sequence and expression data.
View Article and Find Full Text PDFIEEE Trans Neural Netw Learn Syst
September 2022
Multiview multi-instance multilabel learning (M3L) is a framework for modeling complex objects. In this framework, each object (or bag) contains one or more instances, is represented with different feature views, and simultaneously annotated with a set of nonexclusive semantic labels. Given the multiplicity of the studied objects, traditional M3L methods generally demand a large number of labeled bags to train a predictive model to annotate bags (or instances) with semantic labels.
View Article and Find Full Text PDFHashing has been widely adopted for large-scale data retrieval in many domains due to its low storage cost and high retrieval speed. Existing cross-modal hashing methods optimistically assume that the correspondence between training samples across modalities is readily available. This assumption is unrealistic in practical applications.
View Article and Find Full Text PDFThe goal of zero-shot learning (ZSL) is to build a classifier that recognizes novel categories with no corresponding annotated training data. The typical routine is to transfer knowledge from seen classes to unseen ones by learning a visual-semantic embedding. Existing multi-label zero-shot learning approaches either ignore correlations among labels, suffer from large label combinations, or learn the embedding using only local or global visual features.
View Article and Find Full Text PDFControlling the quality of tertiary structures computed for a protein molecule remains a central challenge in de-novo protein structure prediction. The rule of thumb is to generate as many structures as can be afforded, effectively acknowledging that having more structures increases the likelihood that some will reside near the sought biologically-active structure. A major drawback with this approach is that computing a large number of structures imposes time and space costs.
View Article and Find Full Text PDFCrowdsourcing is an economic and efficient strategy aimed at collecting annotations of data through an online platform. Crowd workers with different expertise are paid for their service, and the task requester usually has a limited budget. How to collect reliable annotations for multilabel data and how to compute the consensus within budget are an interesting and challenging, but rarely studied, problem.
View Article and Find Full Text PDFMotivation: Isoforms are alternatively spliced mRNAs of genes. They can be translated into different functional proteoforms, and thus greatly increase the functional diversity of protein variants (or proteoforms). Differentiating the functions of isoforms (or proteoforms) helps understanding the underlying pathology of various complex diseases at a deeper granularity.
View Article and Find Full Text PDFDiscovering driver pathways is an essential step to uncover the molecular mechanism underlying cancer and to explore precise treatments for cancer patients. However, due to the difficulties of mapping genes to pathways and the limited knowledge about pathway interactions, most previous work focus on identifying individual pathways. In practice, two (or even more) pathways interplay and often cooperatively trigger cancer.
View Article and Find Full Text PDFClustering is a fundamental data exploration task which aims at discovering the hidden grouping structure in the data. The traditional clustering methods typically compute a single partition. However, there often exist different and equally meaningful clusterings in complex data.
View Article and Find Full Text PDFIn multiview multilabel learning, each object is represented by several heterogeneous feature representations and is also annotated with a set of discrete nonexclusive labels. Previous studies typically focus on capturing the shared latent patterns among multiple views, while not sufficiently considering the diverse characteristics of individual views, which can cause performance degradation. In this article, we propose a novel approach [individuality- and commonality-based multiview multilabel learning (ICM2L)] to explicitly explore the individuality and commonality information of multilabel multiple view data in a unified model.
View Article and Find Full Text PDFMotivation: Alternative splicing contributes to the functional diversity of protein species and the proteoforms translated from alternatively spliced isoforms of a gene actually execute the biological functions. Computationally predicting the functions of genes has been studied for decades. However, how to distinguish the functional annotations of isoforms, whose annotations are essential for understanding developmental abnormalities and cancers, is rarely explored.
View Article and Find Full Text PDFInflux evidences show that red long non-coding RNAs (lncRNAs) play important roles in various critical biological processes, and they afffect the development and progression of various human diseases. Therefore, it is necessary to precisely identify the lncRNA-disease associations. The identification precision can be improved by developing data integrative models.
View Article and Find Full Text PDFMotivation: Long non-coding RNAs (lncRNAs) play crucial roles in complex disease diagnosis, prognosis, prevention and treatment, but only a small portion of lncRNA-disease associations have been experimentally verified. Various computational models have been proposed to identify lncRNA-disease associations by integrating heterogeneous data sources. However, existing models generally ignore the intrinsic structure of data sources or treat them as equally relevant, while they may not be.
View Article and Find Full Text PDFMany real-world problems involve massive amounts of data. Under these circumstances learning algorithms often become prohibitively expensive, making scalability a pressing issue to be addressed. A common approach is to perform sampling to reduce the size of the dataset and enable efficient learning.
View Article and Find Full Text PDFHigh-throughput experimental techniques provide a wide variety of heterogeneous proteomic data sources. To exploit the information spread across multiple sources for protein function prediction, these data sources are transformed into kernels and then integrated into a composite kernel. Several methods first optimize the weights on these kernels to produce a composite kernel, and then train a classifier on the composite kernel.
View Article and Find Full Text PDFBMC Bioinformatics
August 2015
Background: High-throughput bio-techniques accumulate ever-increasing amount of genomic and proteomic data. These data are far from being functionally characterized, despite the advances in gene (or gene's product proteins) functional annotations. Due to experimental techniques and to the research bias in biology, the regularly updated functional annotation databases, i.
View Article and Find Full Text PDFBackground: High throughput techniques produce multiple functional association networks. Integrating these networks can enhance the accuracy of protein function prediction. Many algorithms have been introduced to generate a composite network, which is obtained as a weighted sum of individual networks.
View Article and Find Full Text PDFBackground: Protein function prediction is to assign biological or biochemical functions to proteins, and it is a challenging computational problem characterized by several factors: (1) the number of function labels (annotations) is large; (2) a protein may be associated with multiple labels; (3) the function labels are structured in a hierarchy; and (4) the labels are incomplete. Current predictive models often assume that the labels of the labeled proteins are complete, i.e.
View Article and Find Full Text PDFIEEE/ACM Trans Comput Biol Bioinform
March 2016
Automated protein function prediction is one of the grand challenges in computational biology. Multi-label learning is widely used to predict functions of proteins. Most of multi-label learning methods make prediction for unlabeled proteins under the assumption that the labeled proteins are completely annotated, i.
View Article and Find Full Text PDFIEEE/ACM Trans Comput Biol Bioinform
January 2014
IEEE/ACM Trans Comput Biol Bioinform
August 2014
High-throughput experimental techniques produce several kinds of heterogeneous proteomic and genomic data sets. To computationally annotate proteins, it is necessary and promising to integrate these heterogeneous data sources. Some methods transform these data sources into different kernels or feature representations.
View Article and Find Full Text PDFThe nearest neighbor technique is a simple and appealing approach to addressing classification problems. It relies on the assumption of locally constant class conditional probabilities. This assumption becomes invalid in high dimensions with a finite number of examples due to the curse of dimensionality.
View Article and Find Full Text PDF