With the ever advancing of modern technologies, it has become increasingly common that the number of collected confounders exceeds the number of subjects in a data set. However, matching based methods for estimating causal treatment effect in their original forms are not capable of handling high-dimensional confounders, and their various modified versions lack statistical support and valid inference tools. In this article, we propose a new approach for estimating causal treatment effect, defined as the difference of the restricted mean survival time (RMST) under different treatments in high-dimensional setting for survival data.
View Article and Find Full Text PDFAccurate assessment of the mean-variance relation can benefit subsequent analysis in biomedical research. However, in most biomedical data, both the true mean and the true variance are unavailable. Instead, raw data are typically used to allow forming sample mean and sample variance in practice.
View Article and Find Full Text PDFWe study estimation and testing in the Poisson regression model with noisy high dimensional covariates, which has wide applications in analyzing noisy big data. Correcting for the estimation bias due to the covariate noise leads to a non-convex target function to minimize. Treating the high dimensional issue further leads us to augment an amenable penalty term to the target function.
View Article and Find Full Text PDFExpression quantitative trait loci (eQTL) studies utilize regression models to explain the variance of gene expressions with genetic loci or single nucleotide polymorphisms (SNPs). However, regression models for eQTL are challenged by the presence of high dimensional non-sparse and correlated SNPs with small effects, and nonlinear relationships between responses and SNPs. Principal component analyses are commonly conducted for dimension reduction without considering responses.
View Article and Find Full Text PDFIdentifying a patient's disease/health status from electronic medical records is a frequently encountered task in electronic health records (EHR) related research, and estimation of a classification model often requires a benchmark training data with patients' known phenotype statuses. However, assessing a patient's phenotype is costly and labor intensive, hence a proper selection of EHR records as a training set is desired. We propose a procedure to tailor the best training subsample with limited sample size for a classification model, minimizing its mean-squared phenotyping/classification error (MSE).
View Article and Find Full Text PDFWe consider analyses of case-control studies assembled from electronic health records (EHRs) where the pool of cases is contaminated by patients who are ineligible for the study. These ineligible patients, referred to as "false cases," should be excluded from the analyses if known. However, the true outcome status of a patient in the case pool is unknown except in a subset whose size may be arbitrarily small compared to the entire pool.
View Article and Find Full Text PDFLarge clinical datasets derived from insurance claims and electronic health record (EHR) systems are valuable sources for precision medicine research. These datasets can be used to develop models for personalized prediction of risk or treatment response. Efficiently deriving prediction models using real world data, however, faces practical and methodological challenges.
View Article and Find Full Text PDFWe propose a two-way additive model with group-specific interactions, where the group information is unknown. We treat the group membership as latent information and propose an EM algorithm for estimation. With a single observation matrix and under the situation of diverging row and column numbers, we rigorously establish the estimation consistency and asymptotic normality of our estimator.
View Article and Find Full Text PDFWhen estimating the treatment effect in an observational study, we use a semiparametric locally efficient dimension reduction approach to assess both the treatment assignment mechanism and the average responses in both treated and non-treated groups. We then integrate all results through imputation, inverse probability weighting and double robust augmentation estimators. Double robust estimators are locally efficient while imputation estimators are super-efficient when the response models are correct.
View Article and Find Full Text PDFIn this paper, we develop a model averaging method to estimate a high-dimensional covariance matrix, where the candidate models are constructed by different orders of polynomial functions. We propose a Mallows-type model averaging criterion and select the weights by minimizing this criterion, which is an unbiased estimator of the expected in-sample squared error plus a constant. Then, we prove the asymptotic optimality of the resulting model average covariance estimators.
View Article and Find Full Text PDFValidation of phenotyping models using Electronic Health Records (EHRs) data conventionally requires gold-standard case and control labels. The labeling process requires clinical experts to retrospectively review patients' medical charts, therefore is labor intensive and time consuming. For some disease conditions, it is prohibitive to identify the gold-standard controls because routine clinical assessments are performed for selective patients who are deemed to possibly have the condition.
View Article and Find Full Text PDFJ R Stat Soc Series B Stat Methodol
September 2019
We develop model averaging estimation in the linear regression model where some covariates are subject to measurement error. The absence of the true covariates in this framework makes the calculation of the standard residual-based loss function impossible. We take advantage of the explicit form of the parameter estimators and construct a weight choice criterion.
View Article and Find Full Text PDFComput Stat Data Anal
April 2020
Field studies in ecology often make use of data collected in a hierarchical fashion, and may combine studies that vary in sampling design. For example, studies of tree recruitment after disturbance may use counts of individual seedlings from plots that vary in spatial arrangement and sampling density. To account for the multi-level design and the fact that more than a few plots usually yield no individuals, a mixed effects zero inflated Poisson model is often adopted.
View Article and Find Full Text PDFHigh-dimensional gene expression data often exhibit intricate correlation patterns as the result of coordinated genetic regulation. In practice, however, it is difficult to directly measure these coordinated underlying activities. Analysis of breast cancer survival data with gene expressions motivates us to use a two-stage latent factor approach to estimate these unobserved coordinated biological processes.
View Article and Find Full Text PDFObjective: Phenotyping patients using electronic health record (EHR) data conventionally requires labeled cases and controls. Assigning labels requires manual medical chart review and therefore is labor intensive. For some phenotypes, identifying gold-standard controls is prohibitive.
View Article and Find Full Text PDFCase-controls studies are popular epidemiological designs for detecting gene-environment interactions in the etiology of complex diseases, where the genetic susceptibility and environmental exposures may often be reasonably assumed independent in the source population. Various papers have presented analytical methods exploiting gene-environment independence to achieve better efficiency, all of which require either a rare disease assumption or a distributional assumption on the genetic variables. We relax both assumptions.
View Article and Find Full Text PDFWe propose a consistent and locally efficient estimator to estimate the model parameters for a logistic mixed effect model with random slopes. Our approach relaxes two typical assumptions: the random effects being normally distributed, and the covariates and random effects being independent of each other. Adhering to these assumptions is particularly difficult in health studies where in many cases we have limited resources to design experiments and gather data in long-term studies, while new findings from other fields might emerge, suggesting the violation of such assumptions.
View Article and Find Full Text PDFCovariate measurement error is a common problem. Improper treatment of measurement errors may affect the quality of estimation and the accuracy of inference. Extensive literature exists on homoscedastic measurement error models, but little research exists on heteroscedastic measurement.
View Article and Find Full Text PDFTang et al. (2003) considered a regression model with missing response, where the missingness mechanism depends on the value of the response variable and hence is nonignorable. They proposed three pseudolikelihood estimators, based on different treatments of the probability distribution of the completely observed covariates.
View Article and Find Full Text PDFJ R Stat Soc Series B Stat Methodol
September 2018
Analysing secondary outcomes is a common practice for case-control studies. Traditional secondary analysis employs either completely parametric models or conditional mean regression models to link the secondary outcome to covariates. In many situations, quantile regression models complement mean-based analyses and provide alternative new insights on the associations of interest.
View Article and Find Full Text PDFStudying the relationship between covariates based on retrospective data is the main purpose of secondary analysis, an area of increasing interest. We examine the secondary analysis problem when multiple covariates are available, while only a regression mean model is specified. Despite the completely parametric modeling of the regression mean function, the case-control nature of the data requires special treatment and semi-parametric efficient estimation generates various nonparametric estimation problems with multivariate covariates.
View Article and Find Full Text PDFBiometrics
September 2018
The problem of estimating the average treatment effects is important when evaluating the effectiveness of medical treatments or social intervention policies. Most of the existing methods for estimating the average treatment effect rely on some parametric assumptions about the propensity score model or the outcome regression model one way or the other. In reality, both models are prone to misspecification, which can have undue influence on the estimated average treatment effect.
View Article and Find Full Text PDFMany methods have recently been proposed for efficient analysis of case-control studies of gene-environment interactions using a retrospective likelihood framework that exploits the natural assumption of gene-environment independence in the underlying population. However, for polygenic modelling of gene-environment interactions, which is a topic of increasing scientific interest, applications of retrospective methods have been limited due to a requirement in the literature for parametric modelling of the distribution of the genetic factors. We propose a general, computationally simple, semiparametric method for analysis of case-control studies that allows exploitation of the assumption of gene-environment independence without any further parametric modelling assumptions about the marginal distributions of any of the two sets of factors.
View Article and Find Full Text PDFPrediction precision is arguably the most relevant criterion of a model in practice and is often a sought after property. A common difficulty with covariates measured with errors is the impossibility of performing prediction evaluation on the data even if a model is completely given without any unknown parameters. We bypass this inherent difficulty by using special properties on moment relations in linear regression models with measurement errors.
View Article and Find Full Text PDFAn important goal in clinical and statistical research is properly modeling the distribution for clustered failure times which have a natural intraclass dependency and are subject to censoring. We handle these challenges with a novel approach that does not impose restrictive modeling or distributional assumptions. Using a logit transformation, we relate the distribution for clustered failure times to covariates and a random, subject-specific effect.
View Article and Find Full Text PDF