Tile-Based Random Forest Analysis for Analyte Discovery in Balanced and Unbalanced GC × GC-TOFMS Data Sets.

Meriem Gaida Caitlin N Cain Robert E Synovec Jean-François Focant Pierre-Hugues Stefanuto

Anal Chem

Organic and Biological Analytical Chemistry Group, Molecular Systems Research Unit, University of Liège, 4000 Liège, Belgium.

Published: September 2023

This study presents a new analysis method called tile-based RF analysis that combines the F-ratio analysis with the Random Forest machine learning algorithm, allowing for better evaluation of unbalanced data sets.
The RF method estimates tile hit importance based on chromatographic signals and ranks them, applying the approach to compare stool samples from omnivores stored under different conditions.
The results indicated that while both methods found similar analytes, the RF analysis identified fewer but more significant hits, highlighting its effectiveness in distinguishing classes in complex biological data.

In this study, we introduce a new nontargeted tile-based supervised analysis method that combines the four-grid tiling scheme previously established for the Fisher ratio (F-ratio) analysis (FRA) with the estimation of tile hit importance using the machine learning (ML) algorithm Random Forest (RF). This approach is termed tile-based RF analysis. As opposed to the standard tile-based F-ratio analysis, the RF approach can be extended to the analysis of unbalanced data sets, i.e., different numbers of samples per class. Tile-based RF computes out-of-bag (oob) tile hit importance estimates for every summed chromatographic signal within each tile on a per-mass channel basis (/). These estimates are then used to rank tile hits in a descending order of importance. In the present investigation, the RF approach was applied for a two-class comparison of stool samples collected from omnivore (O) subjects and stored using two different storage conditions: liquid (Liq) and lyophilized (Lyo). Two final hit lists were generated using balanced (8 vs Eight comparison) and unbalanced (8 vs Nine comparison) data sets and compared to the hit list generated by the standard F-ratio analysis. Similar class-distinguishing analytes ( < 0.01) were discovered by both methods. However, while the FRA discovered a more comprehensive hit list (65 hits), the RF approach strictly discovered hits (31 hits for the balanced data set comparison and 29 hits for the unbalanced data set comparison) with concentration ratios, [OLiq]/[OLyo], greater than 2 (or less than 0.5). This difference is attributed to the more stringent feature selection process used by the RF algorithm. Moreover, our findings suggest that the RF approach is a promising method for identifying class-distinguishing analytes in settings characterized by both high between-class variance and high within-class variance, making it an advantageous method in the study of complex biological matrices.

Download full-text PDF	Source
http://dx.doi.org/10.1021/acs.analchem.3c01872	DOI Listing

Publication Analysis

Top Keywords

data sets

f-ratio analysis

random forest

tile hit

unbalanced data

hit list

class-distinguishing analytes

data set

set comparison

analysis

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!