This analysis takes an in-depth look into the difficulties encountered by automatic methods for domain decomposition from three-dimensional structure. The analysis involves a multi-faceted set of criteria including the integrity of secondary structure elements, the tendency toward fragmentation of domains, domain boundary consistency and topology. The strength of the analysis comes from the use of a new comprehensive benchmark dataset, which is based on consensus among experts (CATH, SCOP and AUTHORS of the 3D structures) and covers 30 distinct architectures and 211 distinct topologies as defined by CATH.
View Article and Find Full Text PDFThe database of multiple alignments for protein structures (DMAPS) provides instant access to pre-computed multiple structure alignments for all protein structure families in the Protein Data Bank (PDB). Protein structure families have been obtained from four distinct classification methods including SCOP, CATH, ENZYME and CE, and multiple structure alignments have been built for all families containing at least three members, using CE-MC software. Currently, multiple structure alignments are available for 3050 SCOP-, 3087 CATH-, 664 ENZYME- and 1707 CE-based families.
View Article and Find Full Text PDFA new scoring function for assessing the statistical significance of protein structure alignment has been developed. The new scores were tested empirically using the combinatorial extension (CE) algorithm. The significance of a given score was given a p-value by curve-fitting the distribution of the scores generated by a random comparison of proteins taken from the PDB_SELECT database and the structural classification of proteins (SCOP) database.
View Article and Find Full Text PDFAccompanying the discovery of an increasing number of proteins, there is the need to provide functional annotation that is both highly accurate and consistent. The Gene Ontology (GO) provides consistent annotation in a computer readable and usable form; hence, GO annotation (GOA) has been assigned to a large number of protein sequences based on direct experimental evidence and through inference determined by sequence homology. Here we show that this annotation can be extended and corrected for cases where protein structures are available.
View Article and Find Full Text PDFCE-MC server (http://cemc.sdsc.edu) provides a web-based facility for the alignment of multiple protein structures based on C-alpha coordinate distances, using combinatorial extension (CE) and Monte Carlo (MC) optimization methods.
View Article and Find Full Text PDFThe assignment of protein domains from three-dimensional structure is critically important in understanding protein evolution and function, yet little quality assurance has been performed. Here, the differences in the assignment of structural domains are evaluated using six common assignment methods. Three human expert methods (AUTHORS (authors' annotation), CATH and SCOP) and three fully automated methods (DALI, DomainParser and PDP) are investigated by analysis of individual methods against the author's assignment as well as analysis based on the consensus among groups of methods (only expert, only automatic, combined).
View Article and Find Full Text PDFMotivation: Analysis of large biological data sets using a variety of parallel processor computer architectures is a common task in bioinformatics. The efficiency of the analysis can be significantly improved by properly handling redundancy present in these data combined with taking advantage of the unique features of these compute architectures.
Results: We describe a generalized approach to this analysis, but present specific results using the program CEPAR, an efficient implementation of the Combinatorial Extension algorithm in a massively parallel (PAR) mode for finding pairwise protein structure similarities and aligning protein structures from the Protein Data Bank.
Structural genomics--large-scale macromolecular 3-dimenional structure determination--is unique in that major participants report scientific progress on a weekly basis. The target database (TargetDB) maintained by the Protein Data Bank (http://targetdb.pdb.
View Article and Find Full Text PDFUsing an integrative genome annotation pipeline (iGAP) for proteome-wide protein structure and functional domain assignment, we analyzed all the proteins of Arabidopsis thaliana. Three-dimensional structures at the level of the domain are assigned by fold recognition and threading based on a novel fold library that extends common domain classifications. iGAP is being applied to proteins from all available proteomes as part of a comparative proteomics resource.
View Article and Find Full Text PDFUnlabelled: We have developed a program for automatic identification of domains in protein three-dimensional structures. Performance of the program was assessed by three different benchmarks: (i) by comparison with the expert-curated SCOP database of structural domains; (ii) by comparison with a collection of manual domain assignments; and (iii) by comparison with a set of 55 proteins, frequently used as a benchmark for automatic domain assignment. In all these benchmarks PDP identified domains correctly in more than 80% of proteins.
View Article and Find Full Text PDFIntensive growth in 3D structure data on DNA-protein complexes as reflected in the Protein Data Bank (PDB) demands new approaches to the annotation and characterization of these data and will lead to a new understanding of critical biological processes involving these data. These data and those from other protein structure classifications will become increasingly important for the modeling of complete proteomes. We propose a fully automated classification of DNA-binding protein domains based on existing 3D-structures from the PDB.
View Article and Find Full Text PDFThe Conserved Key Amino Acid Positions DataBase (CKAAPs DB) provides access to an analysis of structurally similar proteins with dissimilar sequences where key residues within a common fold are identified. CKAAPs may be important in protein folding and structural stability and function, and hence useful for protein engineering studies. This paper provides an update to the initial report of CKAAPs DB [Li et al.
View Article and Find Full Text PDFWe have developed a new algorithm for the alignment of multiple protein structures based on a Monte Carlo optimization technique. The algorithm uses pair-wise structural alignments as a starting point. Four different types of moves were designed to generate random changes in the alignment.
View Article and Find Full Text PDFAn all-against-all protein structure comparison using the Combinatorial Extension (CE) algorithm applied to a representative set of PDB structures revealed a gallery of common substructures in proteins (http://cl.sdsc.edu/ce.
View Article and Find Full Text PDFComparisons of protein sequence via cyclic training of Hidden Markov Models (HMMs) in conjunction with alignments of three-dimensional structure, using the Combinatorial Extension (CE) algorithm, reveal two putative EF-hand metal binding domains in acetylcholinesterase. Based on sequence similarity, putative EF-hands are also predicted for the neuroligin family of cell surface proteins. These predictions are supported by experimental evidence.
View Article and Find Full Text PDFComparing and subsequently classifying protein structures information has received significant attention concurrent with the increase in the number of experimentally derived 3-dimensional structures. Classification schemes have focused on biological function found within protein domains and on structure classification based on topology. Here an alternative view is presented that groups substructures.
View Article and Find Full Text PDFThe Conserved Key Amino Acid Positions DataBase (CKAAPs DB) provides access to an analysis of structurally similar proteins with dissimilar sequences where key residues within a common fold are identified. The derivation and significance of CKAAPs starting from pairwise structure alignments is described fully in Reddy et al. [Reddy,B.
View Article and Find Full Text PDFThe database reported here is derived using the Combinatorial Extension (CE) algorithm which compares pairs of protein polypeptide chains and provides a list of structurally similar proteins along with their structure alignments. Using CE, structure-structure alignments can provide insights into biological function. When a protein of known function is shown to be structurally similar to a protein of unknown function, a relationship might be inferred; a relationship not necessarily detectable from sequence comparison alone.
View Article and Find Full Text PDFThe Protein Data Bank (PDB; http://www.rcsb.org/pdb/ ) is the single worldwide archive of structural data of biological macromolecules.
View Article and Find Full Text PDFActa Crystallogr D Biol Crystallogr
November 1998
Databases containing macromolecular structure data provide a crystallographer with important tools for use in solving, refining and understanding the functional significance of their protein structures. Given this importance, this paper briefly summarizes past progress by outlining the features of the significant number of relevant databases developed to date. One recent database, PDB+, containing all current and obsolete structures deposited with the Protein Data Bank (PDB) is discussed in more detail.
View Article and Find Full Text PDFA new algorithm is reported which builds an alignment between two protein structures. The algorithm involves a combinatorial extension (CE) of an alignment path defined by aligned fragment pairs (AFPs) rather than the more conventional techniques using dynamic programming and Monte Carlo optimization. AFPs, as the name suggests, are pairs of fragments, one from each protein, which confer structure similarity.
View Article and Find Full Text PDFComput Appl Biosci
October 1997
Motivation: To provide data management tools to maintain and query efficiently experimental and derived protein data with the goal of providing new insights into structure-function relationships. The tools should be portable, extensible, and accessible locally, or via the World Wide Web, providing data that would not otherwise be available.
Results: The initial phase of the work, the data representation and query of all available macromolecular structure data, including real-time access to complex property patterns based on the amino acid sequence, is reported.
Proc Int Conf Intell Syst Mol Biol
December 1995
A computer tool has been developed for revealing sets of oligonucleotides invariant for isofunctional families of DNA (RNA) and for using these in functional identification of nucleotide sequences. The tool allows one to: build up vocabularies of invariant oligonucleotides for the families of isofunctional nucleotide sequences; assess significance of the vocabularies; identify nucleotide sequences with the vocabularies of invariant oligonucleotides; determine the most effective identification parameters to minimize first and second type errors; assess the efficiency of identification of individual isofunctional families with the oligonucleotide vocabularies; determine the evolutionary characteristics of the families of isofunctional sequences on which vocabulary volume depends. Based on the system mentioned, we have analyzed a total of 322 protein-encoding gene families and have built up sets of invariant oligonucleotides, or again, oligonucleotide vocabularies that are characteristic of gene families and subfamilies.
View Article and Find Full Text PDF