Many purported pseudogenes in bacterial genomes are bona fide genes.

BMC Genomics

Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, PA, USA.

Published: April 2024

Background: Microbial genomes are largely comprised of protein coding sequences, yet some genomes contain many pseudogenes caused by frameshifts or internal stop codons. These pseudogenes are believed to result from gene degradation during evolution but could also be technical artifacts of genome sequencing or assembly.

Results: Using a combination of observational and experimental data, we show that many putative pseudogenes are attributable to errors that are incorporated into genomes during assembly. Within 126,564 publicly available genomes, we observed that nearly identical genomes often substantially differed in pseudogene counts. Causal inference implicated assembler, sequencing platform, and coverage as likely causative factors. Reassembly of genomes from raw reads confirmed that each variable affects the number of putative pseudogenes in an assembly. Furthermore, simulated sequencing reads corroborated our observations that the quality and quantity of raw data can significantly impact the number of pseudogenes in an assembler dependent fashion. The number of unexpected pseudogenes due to internal stops was highly correlated (R = 0.96) with average nucleotide identity to the ground truth genome, implying relative pseudogene counts can be used as a proxy for overall assembly correctness. Applying our method to assemblies in RefSeq resulted in rejection of 3.6% of assemblies due to significantly elevated pseudogene counts. Reassembly from real reads obtained from high coverage genomes showed considerable variability in spurious pseudogenes beyond that observed with simulated reads, reinforcing the finding that high coverage is necessary to mitigate assembly errors.

Conclusions: Collectively, these results demonstrate that many pseudogenes in microbial genome assemblies are actually genes. Our results suggest that high read coverage is required for correct assembly and indicate an inflated number of pseudogenes due to internal stops is indicative of poor overall assembly quality.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC11017572PMC
http://dx.doi.org/10.1186/s12864-024-10137-0DOI Listing

Publication Analysis

Top Keywords

pseudogene counts
12
pseudogenes
9
genomes
8
putative pseudogenes
8
number pseudogenes
8
pseudogenes internal
8
internal stops
8
high coverage
8
assembly
6
purported pseudogenes
4

Similar Publications

Many purported pseudogenes in bacterial genomes are bona fide genes.

BMC Genomics

April 2024

Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, PA, USA.

Background: Microbial genomes are largely comprised of protein coding sequences, yet some genomes contain many pseudogenes caused by frameshifts or internal stop codons. These pseudogenes are believed to result from gene degradation during evolution but could also be technical artifacts of genome sequencing or assembly.

Results: Using a combination of observational and experimental data, we show that many putative pseudogenes are attributable to errors that are incorporated into genomes during assembly.

View Article and Find Full Text PDF

Objective: This study endeavored to explore the relationship between exosome-derived lncRNA Double Homeobox A Pseudogene 8 (DUXAP8) and Chondroitin Polymerizing Factor 2 (CHPF2), and their roles in the pathogenesis of intracranial aneurysm (IA).

Methods: The shared targeted molecules (DUXAP8 and CHPF2) were detected via GSE122897 and GSE75436 datasets. A total of 312 patients with IAs were incorporated into this study.

View Article and Find Full Text PDF

Background: Numerous researches have reported that long noncoding RNAs (lncRNAs) participate in tumor development and progression. LncRNA apolipoprotein C-I pseudogene 1 (), a pseudogene located in 19q13.2 between apolipoprotein C-I and apolipoprotein C-IV, is involved in a variety of diseases.

View Article and Find Full Text PDF

Introduction: (Prunoideae: Rosaceae), a relic shrub with strong resistance and multiple application values, is endangered in China. Extensive research had been devoted to gene expression, molecular markers, plastid genome analysis, and genetic background investigations of . However, the mitochondrial genome of this species has not been systematically described, owing to the complexity of the plant mitogenome.

View Article and Find Full Text PDF

Phylogeny, Ecology, and Gene Families Covariation Shaped the Olfactory Subgenome of Rodents.

Genome Biol Evol

November 2023

Institutdes Sciences de l'Evolution de Montpellier (ISEM), CNRS, IRD, EPHE, Université de Montpellier, Montpellier, France.

Olfactory receptor (OR) genes represent the largest multigenic family in mammalian genomes and encode proteins that bind environmental odorant molecules. The OR repertoire is extremely variable among species and is subject to many gene duplications and losses, which have been linked to ecological adaptations in mammals. Although they have been studied on a broad taxonomic scale (i.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!