CoCoPyE: feature engineering for learning and prediction of genome quality indices.

Gigascience

Department of Applied Bioinformatics, Institute of Microbiology and Genetics, University of Goettingen, Goldschmidtstr. 1, 37077 Goettingen, Germany.

Published: January 2024

AI Article Synopsis

  • The study focuses on the advancements in reconstructing microbial genomes from metagenomic data and highlights the variability in data quality that needs assessment.
  • The authors introduced a new tool called CoCoPyE, which utilizes a two-stage process to accurately predict genome quality by identifying genomic markers and refining estimates with machine learning.
  • CoCoPyE outperformed existing tools in accuracy during simulations and provides both a web server and Python implementation for easy integration into genome analysis workflows.

Article Abstract

Background: The exploration of the microbial world has been greatly advanced by the reconstruction of genomes from metagenomic sequence data. However, the rapidly increasing number of metagenome-assembled genomes has also resulted in a wide variation in data quality. It is therefore essential to quantify the achieved completeness and possible contamination of a reconstructed genome before it is used in subsequent analyses. The classical approach for the estimation of quality indices solely relies on a relatively small number of universal single-copy genes. Recent tools try to extend the genomic coverage of estimates for an increased accuracy.

Results: We developed CoCoPyE, a fast tool based on a novel 2-stage feature extraction and transformation scheme. First, it identifies genomic markers and then refines the marker-based estimates with a machine learning approach. In our simulation studies, CoCoPyE showed a more accurate prediction of quality indices than the existing tools. While the CoCoPyE web server offers an easy way to try out the tool, the freely available Python implementation enables integration into existing genome reconstruction pipelines.

Conclusions: CoCoPyE provides a new approach to assess the quality of genome data. It complements and improves existing tools and may help researchers to better distinguish between low-quality draft and high-quality genome assemblies in metagenome sequencing projects.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC11503480PMC
http://dx.doi.org/10.1093/gigascience/giae079DOI Listing

Publication Analysis

Top Keywords

quality indices
12
existing tools
8
cocopye
5
genome
5
quality
5
cocopye feature
4
feature engineering
4
engineering learning
4
learning prediction
4
prediction genome
4

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!