IEEE/ACM Trans Comput Biol Bioinform
August 2019
The Burrows-Wheeler transform (BWT) of short-read data has unexplored potential utilities, such as for efficient and sensitive variation analysis against multiple reference genome sequences, because it does not depend on any particular reference genome sequence, unlike conventional mapping-based methods. However, since the amount of read data is generally much larger than the size of the reference sequence, computation of the BWT of reads is not easy, and this hampers development of potential applications. For the alleviation of this problem, a new method of computing the BWT of reads in parallel is proposed.
View Article and Find Full Text PDFBMC Bioinformatics
July 2016
Background: The potential utility of the Burrows-Wheeler transform (BWT) of a large amount of short-read data ("reads") has not been fully studied. The BWT basically serves as a lossless dictionary of reads, unlike the heuristic and lossy reads-to-genome mapping results conventionally obtained in the first step of sequence analysis. Thus, it is naturally expected to lead to development of sensitive methods for analysis of short-read data.
View Article and Find Full Text PDFEtiology of narcolepsy-cataplexy involves multiple genetic and environmental factors. While the human leukocyte antigen (HLA)-DRB1*15:01-DQB1*06:02 haplotype is strongly associated with narcolepsy, it is not sufficient for disease development. To identify additional, non-HLA susceptibility genes, we conducted a genome-wide association study (GWAS) using Japanese samples.
View Article and Find Full Text PDFMotivation: Sequence-variation analysis is conventionally performed on mapping results that are highly redundant and occasionally contain undesirable heuristic biases. A straightforward approach to single-nucleotide polymorphism (SNP) analysis, using the Burrows-Wheeler transform (BWT) of short-read data, is proposed.
Results: The BWT makes it possible to simultaneously process collections of read fragments of the same sequences; accordingly, SNPs were found from the BWT much faster than from the mapping results.
Elucidation of the genetic susceptibility factors for diabetic retinopathy (DR) is important to gain insight into the pathogenesis of DR, and may help to define genetic risk factors for this condition. In the present study, we conducted a three-stage genome-wide association study (GWAS) to identify DR susceptibility loci in Japanese patients, which comprised a total of 837 type 2 diabetes patients with DR (cases) and 1,149 without DR (controls). From the stage 1 genome-wide scan of 446 subjects (205 cases and 241 controls) on 614,216 SNPs, 249 SNPs were selected for the stage 2 replication in 623 subjects (335 cases and 288 controls).
View Article and Find Full Text PDFIn humans, narcolepsy with cataplexy (narcolepsy) is a sleep disorder that is characterized by sleepiness, cataplexy and rapid eye movement (REM) sleep abnormalities. Narcolepsy is caused by a reduction in the number of neurons that produce hypocretin (orexin) neuropeptide. Both genetic and environmental factors contribute to the development of narcolepsy.
View Article and Find Full Text PDFTo discover susceptibility genes of late-onset Alzheimer's disease (LOAD), we conducted a 3-stage genome-wide association study (GWAS) using three populations: Japanese from the Japanese Genetic Consortium for Alzheimer Disease (JGSCAD), Koreans, and Caucasians from the Alzheimer Disease Genetic Consortium (ADGC). In Stage 1, we evaluated data for 5,877,918 genotyped and imputed SNPs in Japanese cases (n = 1,008) and controls (n = 1,016). Genome-wide significance was observed with 12 SNPs in the APOE region.
View Article and Find Full Text PDFJ Bioinform Comput Biol
August 2012
Myers' elegant and powerful bit-parallel dynamic programming algorithm for approximate string matching has a restriction that the query length should be within the word size of the computer, typically 64. We propose a modification of Myers' algorithm, in which the modification has a restriction not on the query length but on the maximum number of mismatches (substitutions, insertions, or deletions), which should be less than half of the word size. The time complexity is O(m log |Σ|), where m is the query length and |Σ| is the size of the alphabet Σ.
View Article and Find Full Text PDFHepatitis B virus (HBV) infection can lead to serious liver diseases, including liver cirrhosis (LC) and hepatocellular carcinoma (HCC); however, about 85-90% of infected individuals become inactive carriers with sustained biochemical remission and very low risk of LC or HCC. To identify host genetic factors contributing to HBV clearance, we conducted genome-wide association studies (GWAS) and replication analysis using samples from HBV carriers and spontaneously HBV-resolved Japanese and Korean individuals. Association analysis in the Japanese and Korean data identified the HLA-DPA1 and HLA-DPB1 genes with P(meta) = 1.
View Article and Find Full Text PDFBackground: Multiple genetic factors and their interactive effects are speculated to contribute to complex diseases. Detecting such genetic interactive effects, i.e.
View Article and Find Full Text PDFFamily and twin studies have indicated that genetic factors have an important role in panic disorder (PD), whereas its pathogenesis has remained elusive. We conducted a genome-wide copy number variation (CNV) association study to elucidate the involvement of structural variants in the etiology of PD. The participants were 2055 genetically unrelated Japanese people (535 PD cases and 1520 controls).
View Article and Find Full Text PDFHematologic abnormalities during current therapy with pegylated interferon and ribavirin (PEG-IFN/RBV) for chronic hepatitis C (CHC) often necessitate dose reduction and premature withdrawal from therapy. The aim of this study was to identify host factors associated with IFN-induced thrombocytopenia by genome-wide association study (GWAS). In the GWAS stage using 900K single-nucleotide polymorphism (SNP) microarrays, 303 Japanese CHC patients treated with PEG-IFN/RBV therapy were genotyped.
View Article and Find Full Text PDFBackground: Array-based detection of copy number variations (CNVs) is widely used for identifying disease-specific genetic variations. However, the accuracy of CNV detection is not sufficient and results differ depending on the detection programs used and their parameters. In this study, we evaluated five widely used CNV detection programs, Birdsuite (mainly consisting of the Birdseye and Canary modules), Birdseye (part of Birdsuite), PennCNV, CGHseg, and DNAcopy from the viewpoint of performance on the Affymetrix platform using HapMap data and other experimental data.
View Article and Find Full Text PDFAn amyotrophic lateral sclerosis (ALS) mutation database has been constructed as a publicly accessible online resource for recording the nucleotide and amino acid variants identified in genes associated with ALS, along with corresponding clinical conditions. The database currently consists of more than 600 entries, including about 180 unique variants found in 25 disease-causative or disease-related genes. In addition to published data collected from literature, novel variants identified by microarray resequencing in our laboratory are incorporated into the database.
View Article and Find Full Text PDFObjective: As more full-text biomedical papers are becoming available in digitized form online, there is a need for tools to mine information from all parts of such papers. Because the figures and legends/captions in biomedical papers provide important information about research outcomes, mining techniques targeting them have attracted a great deal of attention. In this study, we focused on pathway figures that illustrate signaling or metabolic pathways, because many of these are important in understanding disease mechanism(s).
View Article and Find Full Text PDFWe introduce a new data structure, a localized suffix array, based on which occurrence information is dynamically represented as the combination of global positional information and local lexicographic order information in text search applications. For the search of a pair of words within a given distance, many candidate positions that share a coarse-grained global position can be compactly represented in term of local lexicographic orders as in the conventional suffix array, and they can be simultaneously examined for violation of the distance constraint at the coarse-grained resolution. Trade-off between the positional and lexicographical information is progressively shifted towards finer positional resolution, and the distance constraint is reexamined accordingly.
View Article and Find Full Text PDFAbstract We have developed efficient in-practice algorithms for computing rank and select functions on a binary string, based on a novel data structure, a hierarchical binary string with hierarchical accumulatives. It efficiently stores decomposed information on partial summations over various scales of subregions of a given binary string, so that the required space overhead ratio is only about 3.5% irrespective of the string length.
View Article and Find Full Text PDFThe recommended treatment for patients with chronic hepatitis C, pegylated interferon-alpha (PEG-IFN-alpha) plus ribavirin (RBV), does not provide sustained virologic response (SVR) in all patients. We report a genome-wide association study (GWAS) to null virological response (NVR) in the treatment of patients with hepatitis C virus (HCV) genotype 1 within a Japanese population. We found two SNPs near the gene IL28B on chromosome 19 to be strongly associated with NVR (rs12980275, P = 1.
View Article and Find Full Text PDFThe establishment of high-throughput single-nucleotide polymorphism (SNP)-typing technologies has enabled astonishing progress to be made in genome-wide association studies (GWAS), and various novel genetic factors associated with complex diseases have been discovered. Our organization has created a public repository database (DB) to achieve a continuous and intensive management of GWAS data and to facilitate data sharing among researchers. In the GWAS DB, information on study design, quality control protocols, allele frequencies, genotype frequencies and statistical genetic analysis results are stored as publicly available data and can be accessed freely, whereas individual genotyping data and raw data are stored as restricted data and can only be accessed with authorization.
View Article and Find Full Text PDFBackground: With improvements in genotyping technologies, genome-wide association studies with hundreds of thousands of SNPs allow the identification of candidate genetic loci for multifactorial diseases in different populations. However, genotyping errors caused by genotyping platforms or genotype calling algorithms may lead to inflation of false associations between markers and phenotypes. In addition, the number of SNPs available for genome-wide association studies in the Japanese population has been investigated using only 45 samples in the HapMap project, which could lead to an inaccurate estimation of the number of SNPs with low minor allele frequencies.
View Article and Find Full Text PDFGenome-wide association studies (GWAS) using a large number of single nucleotide polymorphisms (SNPs) have successfully been applied to identify genetic variants of common diseases. However, genotyping using the new array technologies is often associated with spurious results that could unfavorably affect analyses of GWAS. Consequently, data cleaning is of paramount importance in excluding spurious genotyping results.
View Article and Find Full Text PDFPredicting the interactions between all the possible pairs of proteins in a given organism (making a protein-protein interaction map) is a crucial subject in bioinformatics. Most of the previous methods based on supervised machine learning use datasets containing approximately the same number of interacting pairs of proteins (positives) and non-interacting pairs of proteins (negatives) for training a classifier and are estimated to yield a large number of false positives. Thinking that the negatives used in previous studies cannot adequately represent all the negatives that need to be taken into account, we have developed a method based on multiple Support Vector Machines (SVMs) that uses more negatives than positives for predicting interactions between pairs of yeast proteins and pairs of human proteins.
View Article and Find Full Text PDF