Optimizing high performance computing workflow for protein functional annotation.

Larissa Stanberry Bhanu Rekepalli Yuan Liu Paul Giblock Roger Higdon Elizabeth Montague William Broomall Natali Kolker Eugene Kolker

Concurr Comput

Bioinformatics & High-throughput Analysis Laboratory, SCRI, High-throughput Analysis Core, SCRI, Predicitive Analytics, Seattle Children's Hospital, Departments of Pediatrics and Biomedical Informatics & Medical Education, University of Washington, DELSA Global.

Published: September 2014

Functional annotation of newly sequenced genomes is one of the major challenges in modern biology. With modern sequencing technologies, the protein sequence universe is rapidly expanding. Newly sequenced bacterial genomes alone contain over 7.5 million proteins. The rate of data generation has far surpassed that of protein annotation. The volume of protein data makes manual curation infeasible, whereas a high compute cost limits the utility of existing automated approaches. In this work, we present an improved and optmized automated workflow to enable large-scale protein annotation. The workflow uses high performance computing architectures and a low complexity classification algorithm to assign proteins into existing clusters of orthologous groups of proteins. On the basis of the Position-Specific Iterative Basic Local Alignment Search Tool the algorithm ensures at least 80% specificity and sensitivity of the resulting classifications. The workflow utilizes highly scalable parallel applications for classification and sequence alignment. Using Extreme Science and Engineering Discovery Environment supercomputers, the workflow processed 1,200,000 newly sequenced bacterial proteins. With the rapid expansion of the protein sequence universe, the proposed workflow will enable scientists to annotate big genome data.

Download full-text PDF	Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4194055	PMC
http://dx.doi.org/10.1002/cpe.3264	DOI Listing

Publication Analysis

Top Keywords

newly sequenced

high performance

performance computing

functional annotation

protein sequence

sequence universe

sequenced bacterial

protein annotation

workflow

protein

Similar Publications

Transmitted HIV-1 Drug Resistance in Estonian Residents and Ukrainian Refugees in 2020 and 2022.

J Glob Antimicrob Resist

January 2025

Faculty of Medicine, Department of Microbiology, University of Tartu, Tartu, Estonia.

Arina Šablinskaja Hiie Soeorg Merit Pauskar Ene-Ly Jõgeda Heli Rajasaar

Objectives: We investigated the prevalence of drug resistance mutations (DRMs) in individuals newly diagnosed with HIV-1 in Estonia in 2020 and 2022, and in Ukrainian war refugees living with HIV who arrived in Estonia in 2022.

Methods: HIV-1 genomic RNA was sequenced in protease-reverse transcriptase and integrase regions. DRMs were determined separately by Stanford University CPR Tool and HIVdb Program.

View Article and Find Full Text PDF

Similar Publications

A novel POT1-TPD presentation: A germline pathogenic POT1 variant discovered in a patient with newly diagnosed posterior fossa ependymoma.

Cancer Genet

January 2025

Cincinnati Children's Hospital Medical Center, Division of Oncology, Cincinnati, OH, USA; University of Cincinnati College of Medicine, Cincinnati, OH, USA. Electronic address:

Stephen Gilene Sara Knapke Daniel Leino Somak Roy Scott Raskin

Introduction: POT1 tumor predisposition (POT1-TPD) is an autosomal dominant disorder characterized by increased lifetime malignancy risk. Melanoma, angiosarcoma, and chronic lymphocytic leukemia are the most frequently reported malignancies [1]. Protection of telomeres protein 1 (POT1) is part of the shelterin protein complex to maintain/protect telomeres [2].

View Article and Find Full Text PDF

Similar Publications

P3 site-directed mutagenesis: An efficient method based on primer pairs with 3'-overhangs.

J Biol Chem

January 2025

Rosalind and Morris Goodman Cancer Institute, McGill University, Montreal, Quebec H3A 1A3, Canada; Department of Medicine, McGill University, Montreal, Quebec H3A 1A3, Canada; Department of Biochemistry, McGill University, Montreal, Quebec H3A 1A3, Canada; McGill University Health Center, Montreal, Quebec H3A 1A3, Canada. Electronic address:

Negar Mousavi Ethan Zhou Arezousadat Razavi Elham Ebrahimi Paulina Varela-Castillo

Site-directed mutagenesis is a fundamental tool indispensable for protein and plasmid engineering. An important technological question is how to achieve the efficiency at the ideal level of 100%. Based on complementary primer pairs, the QuickChange method has been widely used, but it requires significant improvements due to its low efficiency and frequent unwanted mutations.

View Article and Find Full Text PDF

Similar Publications

Unveiling triclosan biodegradation: Novel metabolic pathways, genomic insights, and global environmental adaptability of Pseudomonas sp. strain W03.

J Hazard Mater

January 2025

Marine Synthetic Ecology Research Center, Southern Marine Science and Engineering Guangdong Laboratory (Zhuhai), School of Marine Science, Sun Yat-sen University, Zhuhai 519080, China. Electronic address:

Lan Qiu Xiaoyuan Guo Hojae Shim Tianwei Hao Zhiwei Liang

The polychlorinated aromatic antimicrobial agent triclosan (TCS) is widely used to indiscriminately and rapidly kill microorganisms. The global use of TCS has led to widespread environmental contamination, posing significant threats to ecosystem and human health. Here we reported a newly isolated Pseudomonas sp.

View Article and Find Full Text PDF

Similar Publications

Pilot work of the 10K Chinese People Genomic Diversity Project along the Silk Road suggests a complex east-west admixture landscape and biological adaptations.

Sci China Life Sci

January 2025

Institute of Rare Diseases, West China Hospital of Sichuan University, Sichuan University, Chengdu, 610000, China.

Guanglin He Hongbing Yao Shuhan Duan Lintao Luo Qiuxia Sun

Genomic sources from China are underrepresented in the population-specific reference database. We performed whole-genome sequencing or genome-wide genotyping on 1,207 individuals from four linguistically diverse groups (1,081 Sinitic, 56 Mongolic, 40 Turkic, and 30 Tibeto-Burman people) living in North China included in the 10K Chinese People Genomic Diversity Project (10K_CPGDP) to characterize the genetic architecture and adaptative history of ethnic groups in the Silk Road Region of China. We observed a population split between Northwest Chinese minorities (NWCMs) and Han Chinese since the Upper Paleolithic and later Neolithic genetic differentiation within NWCMs.

View Article and Find Full Text PDF

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!