Publications by authors named "Kimmen Sjolander"

Achieving high accuracy in orthology inference is essential for many comparative, evolutionary and functional genomic analyses, yet the true evolutionary history of genes is generally unknown and orthologs are used for very different applications across phyla, requiring different precision-recall trade-offs. As a result, it is difficult to assess the performance of orthology inference methods. Here, we present a community effort to establish standards and an automated web-based service to facilitate orthology benchmarking.

View Article and Find Full Text PDF

Experimental data exists for only a vanishingly small fraction of sequenced microbial genes. This community page discusses the progress made by the COMBREX project to address this important issue using both computational and experimental resources.

View Article and Find Full Text PDF

The PhyloFacts 'Fast Approximate Tree Classification' (FAT-CAT) web server provides a novel approach to ortholog identification using subtree hidden Markov model-based placement of protein sequences to phylogenomic orthology groups in the PhyloFacts database. Results on a data set of microbial, plant and animal proteins demonstrate FAT-CAT's high precision at separating orthologs and paralogs and robustness to promiscuous domains. We also present results documenting the precision of ortholog identification based on subtree hidden Markov model scoring.

View Article and Find Full Text PDF

Abstract The interface of protein structural biology, protein biophysics, molecular evolution, and molecular population genetics forms the foundations for a mechanistic understanding of many aspects of protein biochemistry. Current efforts in interdisciplinary protein modeling are in their infancy and the state-of-the art of such models is described. Beyond the relationship between amino acid substitution and static protein structure, protein function, and corresponding organismal fitness, other considerations are also discussed.

View Article and Find Full Text PDF

Pyrroloquinoline quinone (PQQ) is a small, redox active molecule that serves as a cofactor for several bacterial dehydrogenases, introducing pathways for carbon utilization that confer a growth advantage. Early studies had implicated a ribosomally translated peptide as the substrate for PQQ production. This study presents a sequence- and structure-based analysis of the components of the pqq operon.

View Article and Find Full Text PDF

Ortholog identification is used in gene functional annotation, species phylogeny estimation, phylogenetic profile construction and many other analyses. Bioinformatics methods for ortholog identification are commonly based on pairwise protein sequence comparisons between whole genomes. Phylogenetic methods of ortholog identification have also been developed; these methods can be applied to protein data sets sharing a common domain architecture or which share a single functional domain but differ outside this region of homology.

View Article and Find Full Text PDF

ModBase (http://salilab.org/modbase) is a database of annotated comparative protein structure models. The models are calculated by ModPipe, an automated modeling pipeline that relies primarily on Modeller for fold assignment, sequence-structure alignment, model building and model assessment (http://salilab.

View Article and Find Full Text PDF

A significant fraction of a plant's nuclear genome encodes chloroplast-targeted proteins, many of which are devoted to the assembly and function of the photosynthetic apparatus. Using digital video imaging of chlorophyll fluorescence, we isolated proton gradient regulation 7 (pgr7) as an Arabidopsis thaliana mutant with low nonphotochemical quenching of chlorophyll fluorescence (NPQ). In pgr7, the xanthophyll cycle and the PSBS gene product, previously identified NPQ factors, were still functional, but the efficiency of photosynthetic electron transport was lower than in the wild type.

View Article and Find Full Text PDF

We present the jump-start simultaneous alignment and tree construction using hidden Markov models (SATCHMO-JS) web server for simultaneous estimation of protein multiple sequence alignments (MSAs) and phylogenetic trees. The server takes as input a set of sequences in FASTA format, and outputs a phylogenetic tree and MSA; these can be viewed online or downloaded from the website. SATCHMO-JS is an extension of the SATCHMO algorithm, and employs a divide-and-conquer strategy to jump-start SATCHMO at a higher point in the phylogenetic tree, reducing the computational complexity of the progressive all-versus-all HMM-HMM scoring and alignment.

View Article and Find Full Text PDF

Motivation: The identification of catalytic residues is a key step in understanding the function of enzymes. While a variety of computational methods have been developed for this task, accuracies have remained fairly low. The best existing method exploits information from sequence and structure to achieve a precision (the fraction of predicted catalytic residues that are catalytic) of 18.

View Article and Find Full Text PDF

Background: Identifying the catalytic residues in enzymes can aid in understanding the molecular basis of an enzyme's function and has significant implications for designing new drugs, identifying genetic disorders, and engineering proteins with novel functions. Since experimentally determining catalytic sites is expensive, better computational methods for identifying catalytic residues are needed.

Results: We propose ResBoost, a new computational method to learn characteristics of catalytic residues.

View Article and Find Full Text PDF

We present the INTREPID web server for predicting functionally important residues in proteins. INTREPID has been shown to boost the recall and precision of catalytic residue prediction over other sequence-based methods and can be used to identify other types of functional residues. The web server takes an input protein sequence, gathers homologs, constructs a multiple sequence alignment and phylogenetic tree and finally runs the INTREPID method to assign a score to each position.

View Article and Find Full Text PDF

Ortholog detection is essential in functional annotation of genomes, with applications to phylogenetic tree construction, prediction of protein-protein interaction and other bioinformatics tasks. We present here the PHOG web server employing a novel algorithm to identify orthologs based on phylogenetic analysis. Results on a benchmark dataset from the TreeFam-A manually curated orthology database show that PHOG provides a combination of high recall and precision competitive with both InParanoid and OrthoMCL, and allows users to target different taxonomic distances and precision levels through the use of tree-distance thresholds.

View Article and Find Full Text PDF

Motivation: Identification of functionally important residues in proteins plays a significant role in biological discovery. Here, we present INTREPID--an information-theoretic approach for functional site identification that exploits the information in large diverse multiple sequence alignments (MSAs). INTREPID uses a traversal of the phylogeny in combination with a positional conservation score, based on Jensen-Shannon divergence, to rank positions in an MSA.

View Article and Find Full Text PDF

Prediction of molecular function of proteins has become an important task in the genomics era. A wide variety of sequence analysis tools are available to biologists for this task. We have selected one or two primary protocols for tasks such as domain detection, subcellular localization, and motif detection.

View Article and Find Full Text PDF

With the explosion in sequence data, accurate prediction of protein function has become a vital task in prioritizing experimental investigation. While computationally efficient methods for homology-based function prediction have been developed to make this approach feasible in high-throughput mode, it is not without its dangers. Biological processes such as gene duplication, domain shuffling, and speciation produce families of related genes whose gene products can have vastly different molecular functions.

View Article and Find Full Text PDF

Prediction of molecular function of proteins has become an important task in the genomics era. A wide variety of sequence analysis tools are available to biologists for this task. We have selected one or two primary protocols for tasks such as domain detection, subcellular localization, and motif detection.

View Article and Find Full Text PDF

The Generation Challenge Programme (GCP; www.generationcp.org) has developed an online resource documenting stress-responsive genes comparatively across plant species.

View Article and Find Full Text PDF

Function prediction by homology is widely used to provide preliminary functional annotations for genes for which experimental evidence of function is unavailable or limited. This approach has been shown to be prone to systematic error, including percolation of annotation errors through sequence databases. Phylogenomic analysis avoids these errors in function prediction but has been difficult to automate for high-throughput application.

View Article and Find Full Text PDF

Phylogenomic analysis addresses the limitations of function prediction based on annotation transfer, and has been shown to enable the highest accuracy in prediction of protein molecular function. The Berkeley Phylogenomics Group provides a series of web servers for phylogenomic analysis: classification of sequences to pre-computed families and subfamilies using the PhyloFacts Phylogenomic Encyclopedia, FlowerPower clustering of proteins sharing the same domain architecture, MUSCLE multiple sequence alignment, SATCHMO simultaneous alignment and tree construction and SCI-PHY subfamily identification. The PhyloBuilder web server provides an integrated phylogenomic pipeline starting with a user-supplied protein sequence, proceeding to homolog identification, multiple alignment, phylogenetic tree construction, subfamily identification and structure prediction.

View Article and Find Full Text PDF

Background: Function prediction by transfer of annotation from the top database hit in a homology search has been shown to be prone to systematic error. Phylogenomic analysis reduces these errors by inferring protein function within the evolutionary context of the entire family. However, accuracy of function prediction for multi-domain proteins depends on all members having the same overall domain structure.

View Article and Find Full Text PDF

The Berkeley Phylogenomics Group presents PhyloFacts, a structural phylogenomic encyclopedia containing almost 10,000 'books' for protein families and domains, with pre-calculated structural, functional and evolutionary analyses. PhyloFacts enables biologists to avoid the systematic errors associated with function prediction by homology through the integration of a variety of experimental data and bioinformatics methods in an evolutionary framework. Users can submit sequences for classification to families and functional subfamilies.

View Article and Find Full Text PDF

In the eight years since phylogenomics was introduced as the intersection of genomics and phylogenetics, the field has provided fundamental insights into gene function, genome history and organismal relationships. The utility of phylogenomics is growing with the increase in the number and diversity of taxa for which whole genome and large transcriptome sequence sets are being generated. We assert that the synergy between genomic and phylogenetic perspectives in comparative biology would be enhanced by the development and refinement of minimal reporting standards for phylogenetic analyses.

View Article and Find Full Text PDF