Publications by authors named "Ukkonen E"

Motivation: Position-specific probability matrices (PPMs, also called position-specific weight matrices) have been the dominating model for transcription factor (TF)-binding motifs in DNA. There is, however, increasing recent evidence of better performance of higher order models such as Markov models of order one, also called adjacent dinucleotide matrices (ADMs). ADMs can model dependencies between adjacent nucleotides, unlike PPMs.

View Article and Find Full Text PDF

In some dimeric cases of transcription factor (TF) binding, the specificity of dimeric motifs has been observed to differ notably from what would be expected were the two factors to bind to DNA independently of each other. Current motif discovery methods are unable to learn monomeric and dimeric motifs in modular fashion such that deviations from the expected motif would become explicit and the noise from dimeric occurrences would not corrupt monomeric models. We propose a novel modeling technique and an expectation maximization algorithm, implemented as software tool MODER, for discovering monomeric TF binding motifs and their dimeric combinations.

View Article and Find Full Text PDF

Motivation: While the position weight matrix (PWM) is the most popular model for sequence motifs, there is growing evidence of the usefulness of more advanced models such as first-order Markov representations, and such models are also becoming available in well-known motif databases. There has been lots of research of how to learn these models from training data but the problem of predicting putative sites of the learned motifs by matching the model against new sequences has been given less attention. Moreover, motif site analysis is often concerned about how different variants in the sequence affect the sites.

View Article and Find Full Text PDF

Motivation: New long read sequencing technologies, like PacBio SMRT and Oxford NanoPore, can produce sequencing reads up to 50 000 bp long but with an error rate of at least 15%. Reducing the error rate is necessary for subsequent utilization of the reads in, e.g.

View Article and Find Full Text PDF

Previous studies have reported that chromosome synteny in Lepidoptera has been well conserved, yet the number of haploid chromosomes varies widely from 5 to 223. Here we report the genome (393 Mb) of the Glanville fritillary butterfly (Melitaea cinxia; Nymphalidae), a widely recognized model species in metapopulation biology and eco-evolutionary research, which has the putative ancestral karyotype of n=31. Using a phylogenetic analyses of Nymphalidae and of other Lepidoptera, combined with orthologue-level comparisons of chromosomes, we conclude that the ancestral lepidopteran karyotype has been n=31 for at least 140 My.

View Article and Find Full Text PDF

The problem of finding the locations in DNA sequences that match a given motif describing the binding specificities of a transcription factor (TF) has many applications in computational biology. This problem has been extensively studied when the position weight matrix (PWM) model is used to represent motifs. We investigate it under the feature motif model, a generalization of the PWM model that does not assume independence between positions in the pattern while being compatible with the original PWM.

View Article and Find Full Text PDF

Although the proteins that read the gene regulatory code, transcription factors (TFs), have been largely identified, it is not well known which sequences TFs can recognize. We have analyzed the sequence-specific binding of human TFs using high-throughput SELEX and ChIP sequencing. A total of 830 binding profiles were obtained, describing 239 distinctly different binding specificities.

View Article and Find Full Text PDF

Motivation: Assembling genomes from short read data has become increasingly popular, but the problem remains computationally challenging especially for larger genomes. We study the scaffolding phase of sequence assembly where preassembled contigs are ordered based on mate pair data.

Results: We present MIP Scaffolder that divides the scaffolding problem into smaller subproblems and solves these with mixed integer programming.

View Article and Find Full Text PDF

Background: The discovery of surprisingly frequent patterns is of paramount interest in bioinformatics and computational biology. Among the patterns considered, those consisting of pairs of solid words that co-occur within a prescribed maximum distance -or gapped factors- emerge in a variety of contexts of DNA and protein sequence analysis. A few algorithms and tools have been developed in connection with specific formulations of the problem, however, none can handle comprehensively each of the multiple ways in which the distance between the two terms in a pair may be defined.

View Article and Find Full Text PDF

Position weight matrices are an important method for modeling signals or motifs in biological sequences, both in DNA and protein contexts. In this paper, we present fast algorithms for the problem of finding significant matches of such matrices. Our algorithms are of the online type, and they generalize classical multipattern matching, filtering, and superalphabet techniques of combinatorial string matching to the problem of weight matrix matching.

View Article and Find Full Text PDF

Members of the large ETS family of transcription factors (TFs) have highly similar DNA-binding domains (DBDs)-yet they have diverse functions and activities in physiology and oncogenesis. Some differences in DNA-binding preferences within this family have been described, but they have not been analysed systematically, and their contributions to targeting remain largely uncharacterized. We report here the DNA-binding profiles for all human and mouse ETS factors, which we generated using two different methods: a high-throughput microwell-based TF DNA-binding specificity assay, and protein-binding microarrays (PBMs).

View Article and Find Full Text PDF

The genetic code-the binding specificity of all transfer-RNAs--defines how protein primary structure is determined by DNA sequence. DNA also dictates when and where proteins are expressed, and this information is encoded in a pattern of specific sequence motifs that are recognized by transcription factors. However, the DNA-binding specificity is only known for a small fraction of the approximately 1400 human transcription factors (TFs).

View Article and Find Full Text PDF

In the wake of numerous sequenced genomes becoming available, computational methods for the reconstruction of metabolic networks have received considerable attention. Here, we review recent methods and software tools useful along the reconstruction workflow, from sequence annotation and network assembly to model verification and testing against experimental data. Reconstruction methods can be divided into three categories, depending on the magnitude of network context which is taken into account in the process of assembling the metabolic model: First, each enzyme may be predicted independently by annotation transfer or machine learning methods.

View Article and Find Full Text PDF

Unlabelled: MOODS (MOtif Occurrence Detection Suite) is a software package for matching position weight matrices against DNA sequences. MOODS implements state-of-the-art online matching algorithms, achieving considerably faster scanning speed than with a simple brute-force search. MOODS is written in C++, with bindings for the popular BioPerl and Biopython toolkits.

View Article and Find Full Text PDF

Homozygosity for the G allele of rs6983267 at 8q24 increases colorectal cancer (CRC) risk approximately 1.5 fold. We report here that the risk allele G shows copy number increase during CRC development.

View Article and Find Full Text PDF

With genome analysis expanding from the study of genes to the study of gene regulation, 'regulatory genomics' utilizes sequence information, evolution and functional genomics measurements to unravel how regulatory information is encoded in the genome.

View Article and Find Full Text PDF

Background: Metabolic fluxes provide invaluable insight on the integrated response of a cell to environmental stimuli or genetic modifications. Current computational methods for estimating the metabolic fluxes from 13C isotopomer measurement data rely either on manual derivation of analytic equations constraining the fluxes or on the numerical solution of a highly nonlinear system of isotopomer balance equations. In the first approach, analytic equations have to be tediously derived for each organism, substrate or labelling pattern, while in the second approach, the global nature of an optimum solution is difficult to prove and comprehensive measurements of external fluxes to augment the 13C isotopomer data are typically needed.

View Article and Find Full Text PDF

Over recent years, five European PhD programmes have organized a series of 'Bioinformatics Research and Education Workshops'. These workshops address the needs of first-year PhD students and have been designed to combine a maximum of educational impact and scientific stimulation with a minimum of financial and administrative effort. We describe the BREW experience and argue that this type of event constitutes an attractive component of PhD education in computational biology and beyond.

View Article and Find Full Text PDF

ReMatch is a web-based, user-friendly tool that constructs stoichiometric network models for metabolic flux analysis, integrating user-developed models into a database collected from several comprehensive metabolic data resources, including KEGG, MetaCyc and CheBI. Particularly, ReMatch augments the metabolic reactions of the model with carbon mappings to facilitate (13)C metabolic flux analysis. The construction of a network model consisting of biochemical reactions is the first step in most metabolic modelling tasks.

View Article and Find Full Text PDF

Melanoma is notorious for its high tendency to metastasize and its refractoriness to treatment thereafter. Metastasis is believed to occur mostly through the lymphatic system, and the status of sentinel lymph nodes is currently recognized as the best prognostic indicator. Unfortunately, the lymphatic metastatic process is still poorly understood and the occurrence of sentinel node metastases (micrometastases) may be underestimated.

View Article and Find Full Text PDF

This protocol describes the use of Enhancer Element Locator (EEL), a computer program that was designed to locate distal enhancer elements in long mammalian sequences. EEL will predict the location and structure of conserved enhancers after being provided with two orthologous DNA sequences and binding specificity matrices for the transcription factors (TFs) that are expected to contribute to the function of the enhancers to be identified. The freely available EEL software can analyze two 1-Mb sequences with 100 TF motifs in about 15 min on a modern Windows, Linux or Mac computer.

View Article and Find Full Text PDF

Malignant melanomas are characterized by their high propensity to invade and metastasize, but the molecular mechanisms of these traits have remained elusive. Our DNA microarray analyses of benign nevi and melanoma tissue specimens revealed that the genes encoding extracellular matrix proteins tenascin-C (TN-C), fibronectin (FN), and procollagen-I (PCOL-I) are highly upregulated in invasive and metastatic melanomas. The expression and distribution of these proteins were further studied by immunohistochemistry in benign nevi, radially and vertically growing melanomas, sentinel node micrometastases, and macrometastases.

View Article and Find Full Text PDF

Motivation: Flux estimation using isotopomer information of metabolites is currently the most reliable method to obtain quantitative estimates of the activity of metabolic pathways. However, the development of isotopomer measurement techniques for intermediate metabolites is a demanding task. Careful planning of isotopomer measurements is thus needed to maximize the available flux information while minimizing the experimental effort.

View Article and Find Full Text PDF

Understanding the regulation of human gene expression requires knowledge of the "second genetic code," which consists of the binding specificities of transcription factors (TFs) and the combinatorial code by which TF binding sites are assembled to form tissue-specific enhancer elements. Using a novel high-throughput method, we determined the DNA binding specificities of GLIs 1-3, Tcf4, and c-Ets1, which mediate transcriptional responses to the Hedgehog (Hh), Wnt, and Ras/MAPK signaling pathways. To identify mammalian enhancer elements regulated by these pathways on a genomic scale, we developed a computational tool, enhancer element locator (EEL).

View Article and Find Full Text PDF