Many mass spectrometry-based studies, as well as other biological experiments produce cluster-correlated data. Failure to account for correlation among observations may result in a classification algorithm overfitting the training data and producing overoptimistic estimated error rates and may make subsequent classifications unreliable. Current common practice for dealing with replicated data is to average each subject replicate sample set, reducing the dataset size and incurring loss of information. In this manuscript we compare three approaches to dealing with cluster-correlated data: unmodified Breiman's Random Forest (URF), forest grown using subject-level averages (SLA), and RF++ with subject-level bootstrapping (SLB). RF++, a novel Random Forest-based algorithm implemented in C++, handles cluster-correlated data through a modification of the original resampling algorithm and accommodates subject-level classification. Subject-level bootstrapping is an alternative sampling method that obviates the need to average or otherwise reduce each set of replicates to a single independent sample. Our experiments show nearly identical median classification and variable selection accuracy for SLB forests and URF forests when applied to both simulated and real datasets. However, the run-time estimated error rate was severely underestimated for URF forests. Predictably, SLA forests were found to be more severely affected by the reduction in sample size which led to poorer classification and variable selection accuracy. Perhaps most importantly our results suggest that it is reasonable to utilize URF for the analysis of cluster-correlated data. Two caveats should be noted: first, correct classification error rates must be obtained using a separate test dataset, and second, an additional post-processing step is required to obtain subject-level classifications. RF++ is shown to be an effective alternative for classifying both clustered and non-clustered data. Source code and stand-alone compiled versions of command-line and easy-to-use graphical user interface (GUI) versions of RF++ for Windows and Linux as well as a user manual (Supplementary File S2) are available for download at: http://sourceforge.org/projects/rfpp/ under the GNU public license.
Download full-text PDF |
Source |
---|---|
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2739274 | PMC |
http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0007087 | PLOS |
Stat Methods Med Res
July 2024
Department of Biostatistics, University of Florida, Gainesville, FL, USA.
In many cluster-correlated data analyses, informative cluster size poses a challenge that can potentially introduce bias in statistical analyses. Different methodologies have been introduced in statistical literature to address this bias. In this study, we consider a complex form of informativeness where the number of observations corresponding to latent levels of a unit-level continuous covariate within a cluster is associated with the response variable.
View Article and Find Full Text PDFSci Rep
December 2022
Public Health Sciences Division, Fred Hutchinson Cancer Research Center, Seattle, WA, 98109, USA.
Cluster-correlated data receives a lot of attention in biomedical and longitudinal studies and it is of interest to assess the generalized dependence between two multivariate variables under the cluster-correlated structure. The Hilbert-Schmidt independence criterion (HSIC) is a powerful kernel-based test statistic that captures various dependence between two random vectors and can be applied to an arbitrary non-Euclidean domain. However, the existing HSIC is not directly applicable to cluster-correlated data.
View Article and Find Full Text PDFAnn Appl Stat
September 2022
Department of Biostatistics, Harvard T.H. Chan School of Public Health.
Although not without controversy, readmission is entrenched as a hospital quality metric with statistical analyses generally based on fitting a logistic-Normal generalized linear mixed model. Such analyses, however, ignore death as a competing risk, although doing so for clinical conditions with high mortality can have profound effects; a hospital's seemingly good performance for readmission may be an artifact of it having poor performance for mortality. in this paper we propose novel multivariate hospital-level performance measures for readmission and mortality that derive from framing the analysis as one of cluster-correlated semi-competing risks data.
View Article and Find Full Text PDFHealth Soc Care Community
November 2022
Institute of Gerontology, Aging Research Network-Jönköping (ARN-J), School of Health and Welfare, Jönköping University, Jönköping, Sweden.
Older persons in Sweden are increasingly encouraged to continue living at home and, if necessary, be supported by home care services (HCS). Studies have examined whether the work environment of staff has an impact on the experiences and well-being of older persons in residential care facilities, but few have examined such associations in HCS. This study examined associations between home care staff's perceptions of their psychosocial work environment and satisfaction with care among older people receiving HCS.
View Article and Find Full Text PDFStat Methods Med Res
December 2022
Department of Global Health and Social Medicine, 1857Harvard Medical School, Boston, MA, USA.
In clinical and public health studies, it is often the case that some variables relevant to the analysis are too difficult or costly to measure for all individuals in the population of interest. Rather, a subsample of these individuals must be identified for additional data collection. A sampling scheme that incorporates readily-available information for the entire target population at the design stage can increase the statistical efficiency of the intended analysis.
View Article and Find Full Text PDFEnter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!