Publications by authors named "Jianhua Z Huang"

Gene knockout (KO) experiments are a proven, powerful approach for studying gene function. However, systematic KO experiments targeting a large number of genes are usually prohibitive due to the limit of experimental and animal resources. Here, we present scTenifoldKnk, an efficient virtual KO tool that enables systematic KO investigation of gene function using data from single-cell RNA sequencing (scRNA-seq).

View Article and Find Full Text PDF

We present scTenifoldNet-a machine learning workflow built upon principal-component regression, low-rank tensor approximation, and manifold alignment-for constructing and comparing single-cell gene regulatory networks (scGRNs) using data from single-cell RNA sequencing. scTenifoldNet reveals regulatory changes in gene expression between samples by comparing the constructed scGRNs. With real data, scTenifoldNet identifies specific gene expression programs associated with different biological processes, providing critical insights into the underlying mechanism of regulatory networks governing cellular transcriptional activities.

View Article and Find Full Text PDF

As single-cell RNA sequencing (scRNA-seq) data becomes widely available, cell-to-cell variability in gene expression, or (scEV), has been increasingly appreciated. However, it remains unclear whether this variability is functionally important and, if so, what are its implications for multi-cellular organisms. Here, we analyzed multiple scRNA-seq data sets from lymphoblastoid cell lines (LCLs), lung airway epithelial cells (LAECs), and dermal fibroblasts (DFs) and, for each cell type, selected a group of homogenous cells with highly similar expression profiles.

View Article and Find Full Text PDF

A semiparametric varying-coefficient mixed regressive spatial autoregressive model is used to study covariate effects on spatially dependent responses, where the effects of some covariates are allowed to vary with other variables. A semiparametric series-based least squares estimating procedure is proposed with the introduction of instrumental variables and series approximations of the conditional expectations. The estimators for both the nonparametric and parametric components of the model are shown to be consistent and their asymptotic distributions are derived.

View Article and Find Full Text PDF

In many applications, non-Gaussian data such as binary or count are observed over a continuous domain and there exists a smooth underlying structure for describing such data. We develop a new functional data method to deal with this kind of data when the data are regularly spaced on the continuous domain. Our method, referred to as Exponential Family Functional Principal Component Analysis (EFPCA), assumes the data are generated from an exponential family distribution, and the matrix of the canonical parameters has a low-rank structure.

View Article and Find Full Text PDF

Recently, the study of protein structures using angular representations has attracted much attention among structural biologists. The main challenge is how to efficiently model the continuous conformational space of the protein structures based on the differences and similarities between different Ramachandran plots. Despite the presence of statistical methods for modeling angular data of proteins, there is still a substantial need for more sophisticated and faster statistical tools to model the large-scale circular datasets.

View Article and Find Full Text PDF

This paper studies the problem of detecting the presence of nanoparticles in noisy transmission electron microscopic (TEM) images and then fitting each nanoparticle with an elliptic shape model. In order to achieve robustness while handling low contrast and high noise in the TEM images, we propose an approach to fuse two kinds of complementary image information, namely, the pixel intensity and the gradient (the first derivative in intensity). Our approach entails two main steps: 1) the first step is to, after necessary pre-processing, employ both intensity-based information and gradient-based information to process the same TEM image and produce two independent sets of results and 2) the subsequent step is to formulate a binary integer programming (BIP) problem for conflict resolution among the two sets of results.

View Article and Find Full Text PDF

We propose a Sparse exponential family Principal Component Analysis (SePCA) method suitable for any type of data following exponential family distributions, to achieve simultaneous dimension reduction and variable selection for better interpretation of the results. Because of the generality of exponential family distributions, the method can be applied to a wide range of applications, in particular when analyzing high dimensional next-generation sequencing data and genetic mutation data in genomics. The use of sparsity-inducing penalty helps produce sparse principal component loading vectors such that the principal components can focus on informative variables.

View Article and Find Full Text PDF

The characteristics of low minor allele frequency (MAF) and weak individual effects make genome-wide association studies (GWAS) for rare variant single nucleotide polymorphisms (SNPs) more difficult when using conventional statistical methods. By aggregating the rare variant effects belonging to the same gene, collapsing is the most common way to enhance the detection of rare variant effects for association analyses with a given trait. In this paper, we propose a novel framework of MAF-based logistic principal component analysis (MLPCA) to derive aggregated statistics by explicitly modeling the correlation between rare variant SNP data, which is categorical.

View Article and Find Full Text PDF

Motivated by data recording the effects of an exercise intervention on subjects' physical activity over time, we develop a model to assess the effects of a treatment when the data are functional with 3 levels (subjects, weeks and days in our application) and possibly incomplete. We develop a model with 3-level mean structure effects, all stratified by treatment and subject random effects, including a general subject effect and nested effects for the 3 levels. The mean and random structures are specified as smooth curves measured at various time points.

View Article and Find Full Text PDF

Principal component analysis (PCA) is a popular dimension reduction method to reduce the complexity and obtain the informative aspects of high-dimensional datasets. When the data distribution is skewed, data transformation is commonly used prior to applying PCA. Such transformation is usually obtained from previous studies, prior knowledge, or trial-and-error.

View Article and Find Full Text PDF

In genome-wide association studies, the primary task is to detect biomarkers in the form of Single Nucleotide Polymorphisms (SNPs) that have nontrivial associations with a disease phenotype and some other important clinical/environmental factors. However, the extremely large number of SNPs comparing to the sample size inhibits application of classical methods such as the multiple logistic regression. Currently the most commonly used approach is still to analyze one SNP at a time.

View Article and Find Full Text PDF

Background: Protein-ligand binding is important for some proteins to perform their functions. Protein-ligand binding sites are the residues of proteins that physically bind to ligands. Despite of the recent advances in computational prediction for protein-ligand binding sites, the state-of-the-art methods search for similar, known structures of the query and predict the binding sites based on the solved structures.

View Article and Find Full Text PDF

Identification of circadian-regulated genes based on temporal transcriptome data is important for studying the regulation mechanism of the circadian system. However, various computational methods adopting different strategies for the identification of cycling transcripts usually yield inconsistent results even for the same dataset, making it challenging to choose the optimal method for a specific circadian study. To address this challenge, we evaluate 5 popular methods, including ARSER (ARS), COSOPT (COS), Fisher's G test (FIS), HAYSTACK (HAY), and JTK_CYCLE (JTK), based on both simulated and empirical datasets.

View Article and Find Full Text PDF

In order to have a better understanding of unexplained heritability for complex diseases in conventional Genome-Wide Association Studies (GWAS), aggregated association analyses based on predefined functional regions, such as genes and pathways, become popular recently as they enable evaluating joint effect of multiple Single-Nucleotide Polymorphisms (SNPs), which helps increase the detection power, especially when investigating genetic variants with weak individual effects. In this paper, we focus on aggregated analysis methods based on the idea of Principal Component Analysis (PCA). The past approaches using PCA mostly make some inherent genotype data and/or risk effect model assumptions, which may hinder the accurate detection of potential disease SNPs that influence disease phenotypes.

View Article and Find Full Text PDF

In this work, we propose a spatial-temporal two-way regularized regression method for reconstructing neural source signals from EEG/MEG time course measurements. The proposed method estimates the dipole locations and amplitudes simultaneously through minimizing a single penalized least squares criterion. The novelty of our methodology is the simultaneous consideration of three desirable properties of the reconstructed source signals, that is, spatial focality, spatial smoothness, and temporal smoothness.

View Article and Find Full Text PDF

Hot spot residues of proteins are fundamental interface residues that help proteins perform their functions. Detecting hot spots by experimental methods is costly and time-consuming. Sequential and structural information has been widely used in the computational prediction of hot spots.

View Article and Find Full Text PDF

Non-convex sparsity-inducing penalties have recently received considerable attentions in sparse learning. Recent theoretical investigations have demonstrated their superiority over the convex counterparts in several sparse learning settings. However, solving the non-convex optimization problems associated with non-convex penalties remains a big challenge.

View Article and Find Full Text PDF

Ordinary differential equations (ODEs) are widely used in biomedical research and other scientific areas to model complex dynamic systems. It is an important statistical problem to estimate parameters in ODEs from noisy observations. In this article we propose a method for estimating the time-varying coefficients in an ODE.

View Article and Find Full Text PDF

Despite considerable progress in the past decades, protein structure prediction remains one of the major unsolved problems in computational biology. Angular-sampling-based methods have been extensively studied recently due to their ability to capture the continuous conformational space of protein structures. The literature has focused on using a variety of parametric models of the sequential dependencies between angle pairs along the protein chains.

View Article and Find Full Text PDF

We consider the problem of estimation in semiparametric varying coefficient models where the covariate modifying the varying coefficients is functional and is modeled nonparametrically. We develop a kernel-based estimator of the nonparametric component and a profiling estimator of the parametric component of the model and derive their asymptotic properties. Specifically, we show the consistency of the nonparametric functional estimates and derive the asymptotic expansion of the estimates of the parametric component.

View Article and Find Full Text PDF

This paper presents a method that enables automated morphology analysis of partially overlapping nanoparticles in electron micrographs. In the undertaking of morphology analysis, three tasks appear necessary: separate individual particles from an agglomerate of overlapping nano-objects; infer the particle's missing contours; and ultimately, classify the particles by shape based on their complete contours. Our specific method adopts a two-stage approach: the first stage executes the task of particle separation, and the second stage conducts simultaneously the tasks of contour inference and shape classification.

View Article and Find Full Text PDF

Gene expression index estimation is an essential step in analyzing multiple probe microarray data. Various modeling methods have been proposed in this area. Amidst all, a popular method proposed in Li and Wong (2001) is based on a multiplicative model, which is similar to the additive model discussed in Irizarry et al.

View Article and Find Full Text PDF

We develop a new principal components analysis (PCA) type dimension reduction method for binary data. Different from the standard PCA which is defined on the observed data, the proposed PCA is defined on the logit transform of the success probabilities of the binary observations. Sparsity is introduced to the principal component (PC) loading vectors for enhanced interpretability and more stable extraction of the principal components.

View Article and Find Full Text PDF

Hierarchical functional data are widely seen in complex studies where sub-units are nested within units, which in turn are nested within treatment groups. We propose a general framework of functional mixed effects model for such data: within unit and within sub-unit variations are modeled through two separate sets of principal components; the sub-unit level functions are allowed to be correlated. Penalized splines are used to model both the mean functions and the principal components functions, where roughness penalties are used to regularize the spline fit.

View Article and Find Full Text PDF