Tile-Based Random Forest Analysis for Analyte Discovery in Balanced and Unbalanced GC × GC-TOFMS Data Sets.

Anal Chem

Organic and Biological Analytical Chemistry Group, Molecular Systems Research Unit, University of Liège, 4000 Liège, Belgium.

Published: September 2023

AI Article Synopsis

  • This study presents a new analysis method called tile-based RF analysis that combines the F-ratio analysis with the Random Forest machine learning algorithm, allowing for better evaluation of unbalanced data sets.
  • The RF method estimates tile hit importance based on chromatographic signals and ranks them, applying the approach to compare stool samples from omnivores stored under different conditions.
  • The results indicated that while both methods found similar analytes, the RF analysis identified fewer but more significant hits, highlighting its effectiveness in distinguishing classes in complex biological data.

Article Abstract

In this study, we introduce a new nontargeted tile-based supervised analysis method that combines the four-grid tiling scheme previously established for the Fisher ratio (F-ratio) analysis (FRA) with the estimation of tile hit importance using the machine learning (ML) algorithm Random Forest (RF). This approach is termed tile-based RF analysis. As opposed to the standard tile-based F-ratio analysis, the RF approach can be extended to the analysis of unbalanced data sets, i.e., different numbers of samples per class. Tile-based RF computes out-of-bag (oob) tile hit importance estimates for every summed chromatographic signal within each tile on a per-mass channel basis (/). These estimates are then used to rank tile hits in a descending order of importance. In the present investigation, the RF approach was applied for a two-class comparison of stool samples collected from omnivore (O) subjects and stored using two different storage conditions: liquid (Liq) and lyophilized (Lyo). Two final hit lists were generated using balanced (8 vs Eight comparison) and unbalanced (8 vs Nine comparison) data sets and compared to the hit list generated by the standard F-ratio analysis. Similar class-distinguishing analytes ( < 0.01) were discovered by both methods. However, while the FRA discovered a more comprehensive hit list (65 hits), the RF approach strictly discovered hits (31 hits for the balanced data set comparison and 29 hits for the unbalanced data set comparison) with concentration ratios, [OLiq]/[OLyo], greater than 2 (or less than 0.5). This difference is attributed to the more stringent feature selection process used by the RF algorithm. Moreover, our findings suggest that the RF approach is a promising method for identifying class-distinguishing analytes in settings characterized by both high between-class variance and high within-class variance, making it an advantageous method in the study of complex biological matrices.

Download full-text PDF

Source
http://dx.doi.org/10.1021/acs.analchem.3c01872DOI Listing

Publication Analysis

Top Keywords

data sets
12
f-ratio analysis
12
random forest
8
tile hit
8
unbalanced data
8
hit list
8
class-distinguishing analytes
8
data set
8
set comparison
8
analysis
7

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!