Publications by authors named "Irwin King"

Accurate prediction of RNA three-dimensional (3D) structures remains an unsolved challenge. Determining RNA 3D structures is crucial for understanding their functions and informing RNA-targeting drug development and synthetic biology design. The structural flexibility of RNA, which leads to the scarcity of experimentally determined data, complicates computational prediction efforts.

View Article and Find Full Text PDF

RNAs represent a class of programmable biomolecules capable of performing diverse biological functions. Recent studies have developed accurate RNA three-dimensional structure prediction methods, which may enable new RNAs to be designed in a structure-guided manner. Here, we develop a structure-to-sequence deep learning platform for the de novo generative design of RNA aptamers.

View Article and Find Full Text PDF
Article Synopsis
  • Bioinformatics is experiencing significant advancements through the use of foundation models (FMs) in AI, which help address challenges like limited annotated data and data noise, marking a crucial shift in computational biology.
  • The survey aims to summarize the evolution, current research landscape, and methodologies of FMs in bioinformatics, with a focus on their applications to specific biological problems such as sequence analysis and structure prediction.
  • The review discusses challenges faced by FMs, such as data noise and interpretability, while providing insights and potential directions for future developments in the application of FMs within biological research.
View Article and Find Full Text PDF

Time series are the primary data type used to record dynamic system measurements and generated in great volume by both physical sensors and online processes (virtual sensors). Time series analytics is therefore crucial to unlocking the wealth of information implicit in available data. With the recent advancements in graph neural networks (GNNs), there has been a surge in GNN-based approaches for time series analysis.

View Article and Find Full Text PDF

The identification of protein homologs in large databases using conventional methods, such as protein sequence comparison, often misses remote homologs. Here, we offer an ultrafast, highly sensitive method, dense homolog retriever (DHR), for detecting homologs on the basis of a protein language model and dense retrieval techniques. Its dual-encoder architecture generates different embeddings for the same protein sequence and easily locates homologs by comparing these representations.

View Article and Find Full Text PDF

Single-cell RNA sequencing has achieved massive success in biological research fields. Discovering novel cell types from single-cell transcriptomics has been demonstrated to be essential in the field of biomedicine, yet is time-consuming and needs prior knowledge. With the unprecedented boom in cell atlases, auto-annotation tools have become more prevalent due to their speed, accuracy and user-friendly features.

View Article and Find Full Text PDF

Heterogeneous graph neural networks (HGNNs) were proposed for representation learning on structural data with multiple types of nodes and edges. To deal with the performance degradation issue when HGNNs become deep, researchers combine metapaths into HGNNs to associate nodes closely related in semantics but far apart in the graph. However, existing metapath-based models suffer from either information loss or high computation costs.

View Article and Find Full Text PDF
Article Synopsis
  • - The study investigated the connection between frailty, dysglycaemia, and mortality in older adults (70+) with type 2 diabetes on insulin therapy using continuous glucose monitors (CGMs), revealing a link between greater frailty and hypoglycaemia.
  • - Conducted in Hong Kong, the HARE study included 225 participants (with 215 providing usable data) and examined factors like age, glycated hemoglobin (HbA), and renal function, finding that frailty levels increased with poorer health metrics.
  • - Results showed that 8 out of 11 CGM metrics were significantly tied to frailty, particularly noting that decreased time in range (TIR) and increased time above range (TAR) were
View Article and Find Full Text PDF

We present a novel self-supervised Contrastive LEArning framework for single-cell ribonucleic acid (RNA)-sequencing (CLEAR) data representation and the downstream analysis. Compared with current methods, CLEAR overcomes the heterogeneity of the experimental data with a specifically designed representation learning task and thus can handle batch effects and dropout events simultaneously. It achieves superior performance on a broad range of fundamental tasks, including clustering, visualization, dropout correction, batch effect removal, and pseudo-time inference.

View Article and Find Full Text PDF

Single cell RNA sequencing (scRNA-seq) allows researchers to explore tissue heterogeneity, distinguish unusual cell identities, and find novel cellular subtypes by providing transcriptome profiling for individual cells. Clustering analysis is usually used to predict cell class assignments and infer cell identities. However, the performance of existing single-cell clustering methods is extremely sensitive to the presence of noise data and outliers.

View Article and Find Full Text PDF

Semi-supervised learning (SSL) has tremendous value in practice due to the utilization of both labeled and unlabelled data. An essential class of SSL methods, referred to as graph-based semi-supervised learning (GSSL) methods in the literature, is to first represent each sample as a node in an affinity graph, and then, the label information of unlabeled samples can be inferred based on the structure of the constructed graph. GSSL methods have demonstrated their advantages in various domains due to their uniqueness of structure, the universality of applications, and their scalability to large-scale data.

View Article and Find Full Text PDF

We present DistillFlow, a knowledge distillation approach to learning optical flow. DistillFlow trains multiple teacher models and a student model, where challenging transformations are applied to the input of the student model to generate hallucinated occlusions as well as less confident predictions. Then, a self-supervised learning framework is constructed: confident predictions from teacher models are served as annotations to guide the student model to learn optical flow for those less confident predictions.

View Article and Find Full Text PDF

Estimating covariance matrix from massive high-dimensional and distributed data is significant for various real-world applications. In this paper, we propose a data-aware weighted sampling-based covariance matrix estimator, namely DACE, which can provide an unbiased covariance matrix estimation and attain more accurate estimation under the same compression ratio. Moreover, we extend our proposed DACE to tackle multiclass classification problems with theoretical justification and conduct extensive experiments on both synthetic and real-world data sets to demonstrate the superior performance of our DACE.

View Article and Find Full Text PDF

Classifying binary imbalanced streaming data is a significant task in both machine learning and data mining. Previously, online area under the receiver operating characteristic (ROC) curve (AUC) maximization has been proposed to seek a linear classifier. However, it is not well suited for handling nonlinearity and heterogeneity of the data.

View Article and Find Full Text PDF

Feature selection is an important problem in machine learning and data mining. We consider the problem of selecting features under the budget constraint on the feature subset size. Traditional feature selection methods suffer from the "monotonic" property.

View Article and Find Full Text PDF

Semi-supervised learning (SSL) is a typical learning paradigms training a model from both labeled and unlabeled data. The traditional SSL models usually assume unlabeled data are relevant to the labeled data, i.e.

View Article and Find Full Text PDF

Expertise retrieval, whose task is to suggest people with relevant expertise on the topic of interest, has received increasing interest in recent years. One of the issues is that previous algorithms mainly consider the documents associated with the experts while ignoring the community information that is affiliated with the documents and the experts. Motivated by the observation that communities could provide valuable insight and distinctive information, we investigate and develop two community-aware strategies to enhance expertise retrieval.

View Article and Find Full Text PDF

Kernel methods have been successfully applied in various applications. To succeed in these applications, it is crucial to learn a good kernel representation, whose objective is to reveal the data similarity precisely. In this paper, we address the problem of multiple kernel learning (MKL), searching for the optimal kernel combination weights through maximizing a generalized performance measure.

View Article and Find Full Text PDF

Feature selection has attracted a huge amount of interest in both research and application communities of data mining. We consider the problem of semi-supervised feature selection, where we are given a small amount of labeled examples and a large amount of unlabeled examples. Since a small number of labeled samples are usually insufficient for identifying the relevant features, the critical problem arising from semi-supervised feature selection is how to take advantage of the information underneath the unlabeled data.

View Article and Find Full Text PDF

Support vector machines (SVM) are state-of-the-art classifiers. Typically L2-norm or L1-norm is adopted as a regularization term in SVMs, while other norm-based SVMs, for example, the L0-norm SVM or even the L(infinity)-norm SVM, are rarely seen in the literature. The major reason is that L0-norm describes a discontinuous and nonconvex term, leading to a combinatorially NP-hard optimization problem.

View Article and Find Full Text PDF

Kernel methods have been widely used in pattern recognition. Many kernel classifiers such as Support Vector Machines (SVM) assume that data can be separated by a hyperplane in the kernel-induced feature space. These methods do not consider the data distribution and are difficult to output the probabilities or confidences for classification.

View Article and Find Full Text PDF

Heat-diffusion models have been successfully applied to various domains such as classification and dimensionality-reduction tasks in manifold learning. One critical local approximation technique is employed to weigh the edges in the graph constructed from data points. This approximation technique is based on an implicit assumption that the data are distributed evenly.

View Article and Find Full Text PDF

The Biased Minimax Probability Machine (BMPM) constructs a classifier which deals with the imbalanced learning tasks. It provides a worst-case bound on the probability of misclassification of future data points based on reliable estimates of means and covariance matrices of the classes from the training data samples, and achieves promising performance. In this paper, we develop a novel yet critical extension training algorithm for BMPM that is based on Second-Order Cone Programming (SOCP).

View Article and Find Full Text PDF

Imbalanced learning is a challenged task in machine learning. In this context, the data associated with one class are far fewer than those associated with the other class. Traditional machine learning methods seeking classification accuracy over a full range of instances are not suitable to deal with this problem, since they tend to classify all the data into a majority class, usually the less important class.

View Article and Find Full Text PDF

The challenging task of medical diagnosis based on machine learning techniques requires an inherent bias, i.e., the diagnosis should favor the "ill" class over the "healthy" class, since misdiagnosing a patient as a healthy person may delay the therapy and aggravate the illness.

View Article and Find Full Text PDF