Publications by authors named "Kanehisa M"

We present here a heuristic method toward predicting the expression specificity in the transcriptional process, which is known to be regulated in large part by promoter sequences, by observing the appearance of conserved sequence patterns in a group of known promoters, such as for housekeeping or tissue-specific genes. Statistically conserved patterns were automatically extracted from a set of unaligned sequences up to 200 bp upstream of the transcription initiation site, by a standard procedure using the Markov chain and binomial distribution models. Furthermore, to obtain signal sequences of optimal lengths we devised a method that combines the multiple alignment and the analysis of the information content (or relative entropy).

View Article and Find Full Text PDF

A new database system named KEGG is being organised to computerize functional aspects of genes and genomes in terms of the binary relations of interacting molecules or genes. We are currently working on the metabolic pathway database that is composed of three interconnected sections: genes, molecules, and pathways, which are also linked to a number of existing databases through our DBGET retrieval system. Here we present the basic concept of binary relations and hierarchical classifications to represent the metabolic pathway data.

View Article and Find Full Text PDF

A new sequence motif library StrProf was constructed characterizing the groups of related proteins in the PDB three-dimensional structure database. For a representative member of each protein family, which was identified by cross-referencing the PDB with the PIR superfamily classification, a group of related sequences was collected by the BLAST search against the nonredundant protein sequence database. For every group, the motifs were identified automatically according to the criteria of conservation and uniqueness of pentapeptide patterns and with a dual dynamic programming algorithm.

View Article and Find Full Text PDF

In order to investigate the molecular mechanisms that alter intron size, we conducted an extensive interspecies comparison of homologous introns among three mammalian groups: human, artiodactyls, and rodents. The size differences of introns were statistically significant among all three groups (longest intron was for human and shortest for rodents), and appear to be due to the accumulation of small deletions, according to the separate count of insertion and deletion frequencies. The distribution of intron size differences also has a shape similar to that for the distribution of insertion/deletion sizes found in pseudogenes.

View Article and Find Full Text PDF

An amino acid index is a set of 20 numerical values representing any of the different physicochemical and biochemical properties of amino acids. As a follow-up to the previous study, we have increased the size of the database, which currently contains 402 published indices, and re-performed the single-linkage cluster analysis. The results basically confirmed the previous findings.

View Article and Find Full Text PDF

A new modeling technique for arriving at the three dimensional (3-D) structure of an RNA stem-loop has been developed based on a conformational search by a genetic algorithm and the following refinement by energy minimization. The genetic algorithm simultaneously optimizes a population of conformations in the predefined conformational space and generates 3-D models of RNA. The fitness function to be optimized by the algorithm has been defined to reflect the satisfaction of known conformational constraints.

View Article and Find Full Text PDF

We have analyzed the distribution of guanine-cytosine (GC) content around the translation initiation site in genomic DNA sequences of different species. A set of sequences belonging to one species is aligned at the translation initiation site, and the average GC content is calculated for 100 base windows over a range of 500 bases each for upstream and downstream region. Consistent with previous observations that coding regions are more GC-rich than non-coding regions, we observe a jump in the GC content at the translation initiation site, except for vertebrate sequences.

View Article and Find Full Text PDF

We have developed simulated annealing algorithms to solve the problem of multiple sequence alignment. The algorithm was shown to give the optimal solution as confirmed by the rigorous dynamic programming algorithm for three-sequence alignment. To overcome long execution times for simulated annealing, we utilized a parallel computer.

View Article and Find Full Text PDF

A procedure to detect similar local structures of proteins from C alpha coordinates is presented. First, the conformations of seven-residue peptide segments are approximated by a limited number of representatives, each of which is assigned a symbol. Thus, the overall conformation of a protein is represented by a symbol string.

View Article and Find Full Text PDF

To automate examination of massive amounts of sequence data for biological function, it is important to computerize interpretation based on empirical knowledge of sequence-function relationships. For this purpose, we have been constructing a knowledge base by organizing various experimental and computational observations as a collection of if-then rules. Here we report an expert system, which utilizes this knowledge base, for predicting localization sites of proteins only from the information on the amino acid sequence and the source origin.

View Article and Find Full Text PDF

An automatic procedure is proposed to identify, from the protein sequence database, conserved amino acid patterns (or sequence motifs) that are exclusive to a group of functionally related proteins. This procedure is applied to the PIR database and a dictionary of sequence motifs that relate to specific superfamilies constructed. The motifs have a practical relevance in identifying the membership of specific superfamilies without the need to perform sequence database searches in 20% of newly determined sequences.

View Article and Find Full Text PDF

We have constructed a perceptron type neural network for E. coli promoter prediction and improved its ability to generalize with a new technique for selecting the sequence features shown during training. We have also reconstructed five previous prediction methods and compared the effectiveness of those methods and our neural network.

View Article and Find Full Text PDF

We have developed an expert system that makes use of various kinds of knowledge organized as "if-then" rules for predicting protein localization sites in Gram-negative bacteria, given the amino acid sequence information alone. We considered four localization sites: the cytoplasm, the inner (cytoplasmic) membrane, the periplasm, and the outer membrane. Most rules were derived from experimental observations.

View Article and Find Full Text PDF

From protein sequence comparison data found in the literature, a library was organized using peptide fragment sequences which are common to related proteins. Each of the fragments was then examined for its occurrence in all the protein superfamilies defined by the NBRF-PIR data base. We have selected those fragment peptides that appear exclusively in one or a few superfamilies, and thus made a library of fragment peptides that characterize specific superfamilies.

View Article and Find Full Text PDF

In order to make better use of the information contained in rapidly expanding amino acid sequence data, a new method to predict various modification sites of proteins from their primary structures is presented. It is also applicable to the prediction of other functional sites in proteins. Here we show the examples of N-glycosylation and serine/threonine phosphorylation sites.

View Article and Find Full Text PDF

We have previously developed a general method based on the statistical technique of discriminant analysis to predict splice junctions in eukaryotic mRNA sequences [Nakata, K., Kanehisa, M. and DeLisi, C.

View Article and Find Full Text PDF

The relationship among 222 published indices representing various physicochemical and biochemical properties of amino acid residues has been investigated by hierarchical cluster analysis. The clustering result is illustrated by the minimum spanning tree, which is conveniently divided into four regions: alpha and turn propensities, beta propensity, hydrophobicity and other physicochemical properties including, among others, bulkiness of amino acid residues. In addition, several subclasses of hydrophobicity scales have been identified: preference of inside and outside, accessible surface area, surrounding hydrophobicity and other mostly experimental scales including transfer free energy, partition coefficients, HPLC parameters and polarity.

View Article and Find Full Text PDF

Using discriminant analysis, three types of protein secondary structure segments--helices, beta-strands and coils--are discriminated by amino acid sequence information alone. A variable in the discriminant analysis is defined by the amino acid index used to represent the sequence data and by the calculation method used to extract a feature in this representation. Thus, the three types of secondary structure segments derived from a set of non-homologous proteins from the Protein Data Bank are analyzed by 888 variables, which correspond to the mean, standard deviation, 3.

View Article and Find Full Text PDF

Multiple measures of similarity were employed to detect weak homologies among protein sequences (e.g., below 30% residue identity).

View Article and Find Full Text PDF

Based on our recent determinations of the nucleotide sequences of the L-aspartate ammonia-lyase genes from Escherichia coli and Pseudomonas fluorescens, primary structures of the two L-aspartate ammonia-lyases and fumarate hydratases from Bacillus subtilis and E. coli (N-terminal partial sequence) were compared by computer analysis. These four enzymes exhibited a significant homology of at least 37%, implying that L-aspartate ammonia-lyase and fumarate hydratase share a common evolutionary origin.

View Article and Find Full Text PDF

The GenBank nucleic acid sequence database is a computer-based collection of all published DNA and RNA sequences; it contains over five million bases in close to six thousand sequence entries drawn from four thousand five hundred published articles. Each sequence is accompanied by relevant biological annotation. The database is available either on magnetic tape, on floppy diskettes, on-line or in hardcopy form.

View Article and Find Full Text PDF