Publications by authors named "Haft D"

The LPXTG protein-sorting signal, found in surface proteins of various Gram-positive pathogens, was the founding member of a growing panel of prokaryotic small C-terminal sorting domains. Sortase A cleaves LPXTG, exosortases (XrtA and XrtB) cleave the PEP-CTERM sorting signal, archaeosortase A cleaves PGF-CTERM, and rhombosortase cleaves GlyGly-CTERM domains. Four sorting signal domains without previously known processing proteases are the MYXO-CTERM, JDVT-CTERM, Synerg-CTERM, and CGP-CTERM domains.

View Article and Find Full Text PDF

The Reference Sequence (RefSeq) project at the National Center for Biotechnology Information (NCBI) contains over 315 000 bacterial and archaeal genomes and 236 million proteins with up-to-date and consistent annotation. In the past 3 years, we have expanded the diversity of the RefSeq collection by including the best quality metagenome-assembled genomes (MAGs) submitted to INSDC (DDBJ, ENA and GenBank), while maintaining its quality by adding validation checks. Assemblies are now more stringently evaluated for contamination and for completeness of annotation prior to acceptance into RefSeq.

View Article and Find Full Text PDF

Motivation: The release of AlphaFold 2.0 has revolutionized our ability to determine protein structures from sequences. This tool also inadvertently opens up many unanticipated opportunities.

View Article and Find Full Text PDF

The bioinformatics of a nine-gene locus, designated selenocysteine-assisted organometallic (SAO), was investigated after identifying six new selenoprotein families and constructing hidden Markov models (HMMs) that find and annotate members of those families. Four are selenoproteins in most SAO loci, including Clostridium difficile. They include two ABC transporter subunits, namely, permease SaoP, with selenocysteine (U) at the channel-gating position, and substrate-binding subunit SaoB.

View Article and Find Full Text PDF

Antimicrobial resistance (AMR) is a significant public health threat. Low-cost whole-genome sequencing, which is often used in surveillance programmes, provides an opportunity to assess AMR gene content in these genomes using approaches. A variety of bioinformatic tools have been developed to identify these genomic elements.

View Article and Find Full Text PDF

Assigning names to β-lactamase variants has been inconsistent and has led to confusion in the published literature. The common availability of whole genome sequencing has resulted in an exponential growth in the number of new β-lactamase genes. In November 2021 an international group of β-lactamase experts met virtually to develop a consensus for the way naturally-occurring β-lactamase genes should be named.

View Article and Find Full Text PDF

Antimicrobial resistance (AMR) is a significant public health threat. With the rise of affordable whole genome sequencing, in silico approaches to assessing AMR gene content can be used to detect known resistance mechanisms and potentially identify novel mechanisms. To enable accurate assessment of AMR gene content, as part of a multi-agency collaboration, NCBI developed a comprehensive AMR gene database, the Bacterial Antimicrobial Resistance Reference Gene Database and the AMR gene detection tool AMRFinder.

View Article and Find Full Text PDF

The Reference Sequence (RefSeq) project at the National Center for Biotechnology Information (NCBI) contains nearly 200 000 bacterial and archaeal genomes and 150 million proteins with up-to-date annotation. Changes in the Prokaryotic Genome Annotation Pipeline (PGAP) since 2018 have resulted in a substantial reduction in spurious annotation. The hierarchical collection of protein family models (PFMs) used by PGAP as evidence for structural and functional annotation was expanded to over 35 000 protein profile hidden Markov models (HMMs), 12 300 BlastRules and 36 000 curated CDD architectures.

View Article and Find Full Text PDF

The number and diversity of known CRISPR-Cas systems have substantially increased in recent years. Here, we provide an updated evolutionary classification of CRISPR-Cas systems and cas genes, with an emphasis on the major developments that have occurred since the publication of the latest classification, in 2015. The new classification includes 2 classes, 6 types and 33 subtypes, compared with 5 types and 16 subtypes in 2015.

View Article and Find Full Text PDF

Unlike for classes A and B, a standardized amino acid numbering scheme has not been proposed for the class C (AmpC) β-lactamases, which complicates communication in the field. Here, we propose a scheme developed through a collaborative approach that considers both sequence and structure, preserves traditional numbering of catalytically important residues (Ser, Lys, Tyr, and Lys), is adaptable to new variants or enzymes yet to be discovered and includes a variation for genetic and epidemiological applications.

View Article and Find Full Text PDF

Antimicrobial resistance (AMR) is a major public health problem that requires publicly available tools for rapid analysis. To identify AMR genes in whole-genome sequences, the National Center for Biotechnology Information (NCBI) has produced AMRFinder, a tool that identifies AMR genes using a high-quality curated AMR gene reference database. The Bacterial Antimicrobial Resistance Reference Gene Database consists of up-to-date gene nomenclature, a set of hidden Markov models (HMMs), and a curated protein family hierarchy.

View Article and Find Full Text PDF

The vitamin B family of cofactors known as cobamides are essential for a variety of microbial metabolisms. We used comparative genomics of 11,000 bacterial species to analyze the extent and distribution of cobamide production and use across bacteria. We find that 86% of bacteria in this data set have at least one of 15 cobamide-dependent enzyme families, but only 37% are predicted to synthesize cobamides de novo.

View Article and Find Full Text PDF

Automatic annotation of protein function is routinely applied to newly sequenced genomes. While this provides a fine-grained view of an organism's functional protein repertoire, proteins, more commonly function in a coordinated manner, such as in pathways or multimeric complexes. Genome Properties (GPs) define such functional entities as a series of steps, originally described by either TIGRFAMs or Pfam entries.

View Article and Find Full Text PDF

The initial report of the mcr-1 (mobile colistin resistance) gene has led to many reports of mcr-1 variants and other mcr genes from different bacterial species originating from human, animal and environmental samples in different geographical locations. Resistance gene nomenclature is complex and unfortunately problems such as different names being used for the same gene/protein or the same name being used for different genes/proteins are not uncommon. Registries exist for some families, such as bla (β-lactamase) genes, but there is as yet no agreed nomenclature scheme for mcr genes.

View Article and Find Full Text PDF

Bacterial floc formation plays a central role in the activated sludge (AS) process, which has been widely utilized for sewage and wastewater treatment. The formation of AS flocs has long been known to require exopolysaccharide biosynthesis. This study demonstrates an additional requirement for a PEP-CTERM protein in Zoogloea resiniphila, a dominant AS bacterium harboring a large exopolysaccharide biosynthesis gene cluster.

View Article and Find Full Text PDF

The Reference Sequence (RefSeq) project at the National Center for Biotechnology Information (NCBI) provides annotation for over 95 000 prokaryotic genomes that meet standards for sequence quality, completeness, and freedom from contamination. Genomes are annotated by a single Prokaryotic Genome Annotation Pipeline (PGAP) to provide users with a resource that is as consistent and accurate as possible. Notable recent changes include the development of a hierarchical evidence scheme, a new focus on curating annotation evidence sources, the addition and curation of protein profile hidden Markov models (HMMs), release of an updated pipeline (PGAP-4), and comprehensive re-annotation of RefSeq prokaryotic genomes.

View Article and Find Full Text PDF

In functionally diverse protein families, conservation in short signature regions may outperform full-length sequence comparisons for identifying proteins that belong to a subgroup within which one specific aspect of their function is conserved. The SIMBAL workflow (Sites Inferred by Metabolic Background Assertion Labeling) is a data-mining procedure for finding such signature regions. It begins by using clues from genomic context, such as co-occurrence or conserved gene neighborhoods, to build a useful training set from a large number of uncharacterized but mutually homologous proteins.

View Article and Find Full Text PDF
Article Synopsis
  • Mycobacterium tuberculosis (Mtb) survives in the acidic, reactive environment of macrophage phagosomes by utilizing dehydrogenases encoded in its genome, which may help it resist host defenses.
  • Mycobacterial short chain dehydrogenases/reductases (SDRs) possess a unique insertion at their NAD binding sites that prevents the typical exchange of NAD/NADH, suggesting a different mechanism for their function.
  • Experiments indicate these SDRs rely on external redox partners instead of cofactor exchange for their catalytic processes, and they are associated with the mftA gene and its corresponding product, which may play a role in this external redox partnership.
View Article and Find Full Text PDF

Colombia is one of the 105 countries that has reported at least one case of extensively drug-resistant tuberculosis (XDR-TB). The Mycobacterium tuberculosis Haarlem genotype is ubiquitous worldwide. Here, we report the high-quality draft genome sequence of a Colombian Haarlem XDR-TB clinical isolate composed of 4,329,127 bp with 4,386 genes.

View Article and Find Full Text PDF