Microbiome-based classification models for fresh produce safety and quality evaluation.

Microbiol Spectr

Department of Molecular and Cellular Biology, University of California Davis, Davis, California, USA.

Published: April 2024

Unlabelled: Small sample sizes and loss of sequencing reads during the microbiome data preprocessing can limit the statistical power of differentiating fresh produce phenotypes and prevent the detection of important bacterial species associated with produce contamination or quality reduction. Here, we explored a machine learning-based -mer hash analysis strategy to identify DNA signatures predictive of produce safety (PS) and produce quality (PQ) and compared it against the amplicon sequence variant (ASV) strategy that uses a typical denoising step and ASV-based taxonomy strategy. Random forest-based classifiers for PS and PQ using 7-mer hash data sets had significantly higher classification accuracy than those using the ASV data sets. We also demonstrated that the proposed combination of integrating multiple data sets and leveraging a 7-mer hash strategy leads to better classification performance for PS and PQ compared to the ASV method but presents lower PS classification accuracy compared to the feature-selected ASV-based taxonomy strategy. Due to the current limitation of generating taxonomy using the 7-mer hash strategy, the ASV-based taxonomy strategy with remarkably less computing time and memory usage is more efficient for PS and PQ classification and applicable for important taxa identification. Results generated from this study lay the foundation for future studies that wish and need to incorporate and/or compare different microbiome sequencing data sets for the application of machine learning in the area of microbial safety and quality of food.

Importance: Identification of generalizable indicators for produce safety (PS) and produce quality (PQ) improves the detection of produce contamination and quality decline. However, effective sequencing read loss during microbiome data preprocessing and the limited sample size of individual studies restrain statistical power to identify important features contributing to differentiating PS and PQ phenotypes. We applied machine learning-based models using individual and integrated -mer hash and amplicon sequence variant (ASV) data sets for PS and PQ classification and evaluated their classification performance and found that random forest (RF)-based models using integrated 7-mer hash data sets achieved significantly higher PS and PQ classification accuracy. Due to the limitation of taxonomic analysis for the 7-mer hash, we also developed RF-based models using feature-selected ASV-based taxonomic data sets, which performed better PS classification than those using the integrated 7-mer hash data set. The RF feature selection method identified 480 PS indicators and 263 PQ indicators with a positive contribution to the PS and PQ classification.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC10986475PMC
http://dx.doi.org/10.1128/spectrum.03448-23DOI Listing

Publication Analysis

Top Keywords

data sets
28
7-mer hash
24
produce safety
12
asv-based taxonomy
12
taxonomy strategy
12
hash data
12
classification accuracy
12
data
10
classification
9
produce
8

Similar Publications

Deep learning is a double-edged sword. The powerful feature learning ability of deep models can effectively improve classification accuracy. Still, when the training samples for each class are limited, it will not only face the problem of overfitting but also significantly affect the classification result.

View Article and Find Full Text PDF

The cabbage aphid, Brevicoryne brassicae, is a major pest on Brassicaceae plants, causing significant yield losses annually. However, the lack of genomic resources has hindered progress in understanding this pest at the molecular level. Here, we present a high-quality, chromosomal-level genome assembly for B.

View Article and Find Full Text PDF

A simple model for the analysis of epidemics based on hospitalization data.

Math Biosci

January 2025

Department of Mathematics, University of Illinois Urbana-Champaign, Urbana, IL, USA; Carl R. Woese Institute for Genomic Biology, University of Illinois Urbana-Champaign, Urbana, IL, USA. Electronic address:

An epidemiological model with a minimal number of parameters is introduced and its structural and practical identifiabity is investigated both analytically and numerically. The model is useful when a high percentage of unreported cases is suspected, hence only hospitalization data are used to fit the model parameters and calculate the basic reproductive number R and the effective reproductive number R. As a case study, the model is used to study the initial surge and the Omicron wave of the COVID-19 epidemic in Belgium.

View Article and Find Full Text PDF

Background: In data-sparse areas such as health care, computer scientists aim to leverage as much available information as possible to increase the accuracy of their machine learning models' outputs. As a standard, categorical data, such as patients' gender, socioeconomic status, or skin color, are used to train models in fusion with other data types, such as medical images and text-based medical information. However, the effects of including categorical data features for model training in such data-scarce areas are underexamined, particularly regarding models intended to serve individuals equitably in a diverse population.

View Article and Find Full Text PDF

Identification of potential drug-target interactions (DTIs) is a crucial step in drug discovery and repurposing. Although deep learning effectively deciphers DTIs, most deep learning-based methods represent drug features from only a single perspective. Moreover, the fusion method of drug and protein features needs further refinement.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!