Detecting and correcting misclassified sequences in the large-scale public databases.

Bioinformatics

Department of Computer Science, Ames, IA 50011, USA.

Published: September 2020

Motivation: As the cost of sequencing decreases, the amount of data being deposited into public repositories is increasing rapidly. Public databases rely on the user to provide metadata for each submission that is prone to user error. Unfortunately, most public databases, such as non-redundant (NR), rely on user input and do not have methods for identifying errors in the provided metadata, leading to the potential for error propagation. Previous research on a small subset of the NR database analyzed misclassification based on sequence similarity. To the best of our knowledge, the amount of misclassification in the entire database has not been quantified. We propose a heuristic method to detect potentially misclassified taxonomic assignments in the NR database. We applied a curation technique and quality control to find the most probable taxonomic assignment. Our method incorporates provenance and frequency of each annotation from manually and computationally created databases and clustering information at 95% similarity.

Results: We found more than two million potentially taxonomically misclassified proteins in the NR database. Using simulated data, we show a high precision of 97% and a recall of 87% for detecting taxonomically misclassified proteins. The proposed approach and findings could also be applied to other databases.

Availability And Implementation: Source code, dataset, documentation, Jupyter notebooks and Docker container are available at https://github.com/boalang/nr.

Supplementary Information: Supplementary data are available at Bioinformatics online.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7821992PMC
http://dx.doi.org/10.1093/bioinformatics/btaa586DOI Listing

Publication Analysis

Top Keywords

public databases
12
rely user
8
taxonomically misclassified
8
misclassified proteins
8
detecting correcting
4
misclassified
4
correcting misclassified
4
misclassified sequences
4
sequences large-scale
4
public
4

Similar Publications

Background: In difficult colorectal cases, surgeons may opt for a hand-assisted laparoscopic (HALS) colectomy or attempt a laparoscopic surgery that may require an unplanned conversion to open (LCOS). We aimed to compare the clinical outcomes of these 2 types of surgeries.

Methods: Colectomies for acute diverticulitis with a HALS or LCOS surgery were selected from the American College of Surgeons (ACS) National Surgical Quality Improvement Program (NSQIP) 2022 Targeted Colectomy Database.

View Article and Find Full Text PDF

Objectives: To determine the top 100 cited authors and the top 20 articles in the Journal of Orthopaedic Trauma (JOT) and compare its impact factor to orthopaedic and non-orthopaedic surgery literature.

Design: Review.

Methods: The Web of Science database was used to determine the top 100 cited authors and top 20 cited articles that originated in JOT from 1995 to the present.

View Article and Find Full Text PDF

Pediatric neuro-oncology patients have one of the highest mortality rates among all children with cancer. Our study examines the potential relationship between palliative care consultation and intensity of in-hospital care and determines if racial and ethnic differences are associated with palliative care consultations during their terminal admission. Retrospective observational study using the Pediatric Health Information System (PHIS) database with data from U.

View Article and Find Full Text PDF

Transformers for Neuroimage Segmentation: Scoping Review.

J Med Internet Res

January 2025

Department of Computer Science and Software Engineering, United Arab Emirates University, Al Ain, United Arab Emirates.

Background: Neuroimaging segmentation is increasingly important for diagnosing and planning treatments for neurological diseases. Manual segmentation is time-consuming, apart from being prone to human error and variability. Transformers are a promising deep learning approach for automated medical image segmentation.

View Article and Find Full Text PDF

Background: Lifestyle interventions have been acknowledged as effective strategies for preventing type 2 diabetes mellitus (T2DM). However, the accessibility of conventional face-to-face interventions is often limited. Digital health intervention has been suggested as a potential solution to overcome the limitation.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!