Medical datasets may be imbalanced and contain errors due to subjective test results and clinical variability. The poor quality of original data affects classification accuracy and reliability. Hence, detecting abnormal samples in the dataset can help clinicians make better decisions.
View Article and Find Full Text PDFIn machine learning (ML), association patterns in the data, paths in decision trees, and weights between layers of the neural network are often entangled due to multiple underlying causes, thus masking the pattern-to-source relation, weakening prediction, and defying explanation. This paper presents a revolutionary ML paradigm: pattern discovery and disentanglement (PDD) that disentangles associations and provides an all-in-one knowledge system capable of (a) disentangling patterns to associate with distinct primary sources; (b) discovering rare/imbalanced groups, detecting anomalies and rectifying discrepancies to improve class association, pattern and entity clustering; and (c) organizing knowledge for statistically supported interpretability for causal exploration. Results from case studies have validated such capabilities.
View Article and Find Full Text PDFMachine Learning has made impressive advances in many applications akin to human cognition for discernment. However, success has been limited in the areas of relational datasets, particularly for data with low volume, imbalanced groups, and mislabeled cases, with outputs that typically lack transparency and interpretability. The difficulties arise from the subtle overlapping and entanglement of functional and statistical relations at the source level.
View Article and Find Full Text PDFBMC Med Inform Decis Mak
January 2021
Background: Statistical data analysis, especially the advanced machine learning (ML) methods, have attracted considerable interest in clinical practices. We are looking for interpretability of the diagnostic/prognostic results that will bring confidence to doctors, patients and their relatives in therapeutics and clinical practice. When datasets are imbalanced in diagnostic categories, we notice that the ordinary ML methods might produce results overwhelmed by the majority classes diminishing prediction accuracy.
View Article and Find Full Text PDFBackground: A protein family has similar and diverse functions locally conserved. An aligned pattern cluster (APC) can reflect the conserved functionality. Discovering aligned residue associations (ARAs) in APCs can reveal subtle inner working characteristics of conserved regions of protein families.
View Article and Find Full Text PDFResidue-residue close contact (R2R-C) data procured from three-dimensional protein-protein interaction (PPI) experiments is currently used for predicting residue-residue interaction (R2R-I) in PPI. However, due to complex physiochemical environments, R2R-I incidences, facilitated by multiple factors, are usually entangled in the source environment and masked in the acquired data. Here we present a novel method, P2K (Pattern to Knowledge), to disentangle R2R-I patterns and render much succinct discriminative information expressed in different specific R2R-I statistical/functional spaces.
View Article and Find Full Text PDFIEEE Trans Nanobioscience
July 2018
Functional region identification is of fundamental importance for protein sequences analysis. Such knowledge provides better scientific understanding and could assist drug discovery. Up-to-date, domain annotation is one approach, but it needs to leverage existing databases.
View Article and Find Full Text PDFA protein family has similar and diverse functions locally conserved as aligned sequence segments. Further discovering their association patterns could reveal subtle family subgroup characteristics. Since (ARAs) in Aligned Pattern Clusters (APCs) are complex and intertwined due to entangled function, factors, and variance in the source environment, we have recently developed a novel method: Aligned Residue Association Discovery and Disentanglement (ARADD) to solve this problem.
View Article and Find Full Text PDFPredicting Protein-Protein Interaction (PPI) is important for making new discoveries in the molecular mechanisms inside a cell. Traditionally, new PPIs are identified through biochemical experiments but such methods are labor-intensive, expensive, time-consuming and technically ineffective due to high false positive rates. Sequence-based prediction is currently the most readily applicable and cost-effective method.
View Article and Find Full Text PDFMotivation: Evolutionarily conserved amino acids within proteins characterize functional or structural regions. Conversely, less conserved amino acids within these regions are generally areas of evolutionary divergence. A priori knowledge of biological function and species can help interpret the amino acid differences between sequences.
View Article and Find Full Text PDFIEEE/ACM Trans Comput Biol Bioinform
October 2017
Unlabelled: Understanding binding cores is of fundamental importance in deciphering Protein-DNA (TF-TFBS) binding and gene regulation. Limited by expensive experiments, it is promising to discover them with variations directly from sequence data. Although existing computational methods have produced satisfactory results, they are one-to-one mappings with no site-specific information on residue/nucleotide variations, where these variations in binding cores may impact binding specificity.
View Article and Find Full Text PDFBackground: The large influx of biological sequences poses the importance of identifying and correlating conserved regions in homologous sequences to acquire valuable biological knowledge. These conserved regions contain statistically significant residue associations as sequence patterns. Thus, patterns from two conserved regions co-occurring frequently on the same sequences are inferred to have joint functionality.
View Article and Find Full Text PDFIEEE/ACM Trans Comput Biol Bioinform
March 2016
Discovering sequence patterns with variations unveils significant functions of a protein family. Existing combinatorial methods of discovering patterns with variations are computationally expensive, and probabilistic methods require more elaborate probabilistic representation of the amino acid associations. To overcome these shortcomings, this paper presents a new computationally efficient method for representing patterns with variations in a compact representation called Aligned Pattern Cluster (AP Cluster).
View Article and Find Full Text PDFBackground: Discovering patterns from gene expression levels is regarded as a classification problem when tissue classes of the samples are given and solved as a discrete-data problem by discretizing the expression levels of each gene into intervals maximizing the interdependence between that gene and the class labels. However, when class information is unavailable, discovering gene expression patterns becomes difficult.
Methods: For a gene pool with large number of genes, we first cluster the genes into smaller groups.
J Bioinform Comput Biol
October 2010
Comparative genomics is concerned with the study of genome structure and function of different species. It can provide useful information for the derivation of evolutionary and functional relationships between genomes. Previous work on genome comparison focuses mainly on comparing the entire genomes for visualization without further analysis.
View Article and Find Full Text PDFThis paper reports the discovery of statistically significant association patterns of gene expression levels from microarray data. By association patterns, we mean certain gene expression intensity intervals having statistically significant associations among themselves and with the tissue classes, such as cancerous and normal tissues. We describe how the significance of the associations among gene expression levels can be evaluated using a statistical measure in an objective manner.
View Article and Find Full Text PDFThis correspondence presents a two-stage classification learning algorithm. The first stage approximates the class-conditional distribution of a discrete space using a separate mixture model, and the second stage investigates the class posterior probabilities by training a network. The first stage explores the generative information that is inherent in each class by using the Chow-Liu (CL) method, which approximates high-dimensional probability with a tree structure, namely, a dependence tree, whereas the second stage concentrates on discriminative learning to distinguish between classes.
View Article and Find Full Text PDFIEEE/ACM Trans Comput Biol Bioinform
November 2006
This paper presents an attribute clustering method which is able to group genes based on their interdependence so as to mine meaningful patterns from the gene expression data. It can be used for gene grouping, selection, and classification. The partitioning of a relational table into attribute subgroups allows a small number of attributes within or across the groups to be selected for analysis.
View Article and Find Full Text PDF