We develop a general statistical framework for the analysis and inference of large tree-structured data, with a focus on developing asymptotic goodness-of-fit tests. We first propose a consistent statistical model for binary trees, from which we develop a class of invariant tests. Using the model for binary trees, we then construct tests for general trees by using the distributional properties of the Continuum Random Tree, which arises as the invariant limit for a broad class of models for tree-structured data based on conditioned Galton-Watson processes. The test statistics for the goodness-of-fit tests are simple to compute and are asymptotically distributed as and random variables. We illustrate our methods on an important application of detecting tumour heterogeneity in brain cancer. We use a novel approach with tree-based representations of magnetic resonance images and employ the developed tests to ascertain tumor heterogeneity between two groups of patients.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC10066867PMC
http://dx.doi.org/10.1080/01621459.2016.1240081DOI Listing

Publication Analysis

Top Keywords

tree-structured data
12
large tree-structured
8
goodness-of-fit tests
8
model binary
8
binary trees
8
tests
5
statistical tests
4
tests large
4
data develop
4
develop general
4

Similar Publications

Credit card usage has surged, heightening concerns about fraud. To address this, advanced credit card fraud detection (CCFD) technology employs machine learning algorithms to analyze transaction behavior. Credit card data's complexity and imbalance can cause overfitting in conventional models.

View Article and Find Full Text PDF

Evaluating drivers of PM air pollution at urban scales using interpretable machine learning.

Waste Manag

January 2025

College of Public Administration, Nanjing Agricultural University, Nanjing 210095, China. Electronic address:

Reducing urban fine particulate matter (PM) concentrations is essential for China to achieve the Sustainable Development Goals (SDGs). Identifying the key drivers of PM will enable the development of targeted strategies to reduce PM levels. This study introduces a machine-learning model that combines CatBoost and the Tree-Structured Parzen Estimator (TPE) to analyze PM concentration across 297 cities between 2000 and 2021.

View Article and Find Full Text PDF

Engine fault diagnosis is a critical task in automotive aftermarket management. Developing appropriate fault-labeled datasets can be challenging due to nonlinearity variations and divergence in feature distribution among different engine kinds or operating scenarios. To solve this task, this study experimentally measures audio emission signals from compression ignition engines in different vehicles, simulating injector failures, intake hose failures, and absence of failures.

View Article and Find Full Text PDF

Motivation: Biomedical visualizations are key to accessing biomedical knowledge and detecting new patterns in large datasets. Interactive visualizations are essential for biomedical data scientists and are omnipresent in data analysis software and data portals. Without appropriate descriptions, these visualizations are not accessible to all people with blindness and low vision, who often rely on screen reader accessibility technologies to access visual information on digital devices.

View Article and Find Full Text PDF

Visualized generator for the simulation of wastewater quality and quantity variations in the sewer system.

Sci Total Environ

December 2024

State Key Joint Laboratory of Environment Simulation and Pollution Control, School of Environment, Tsinghua University, Beijing 100084, China. Electronic address:

Data generators are imperative to support design, management, scenario simulation, risk assessment, and regulatory compliance. Hybrid sewer systems struggle with accurate water quality and quantity monitoring due to variable flow patterns, missing connections, limited monitoring capacity. To accurately regenerate operational data for hybrid sewer system along the sewer shed, a visualized generator was developed to simulate wastewater quantity and quality variations within different scales in the sewer system.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!