FSG: Fast String Graph Construction for De Novo Assembly.

J Comput Biol

Dipartimento di Informatica Sistemistica e Comunicazione, Università degli Studi di Milano-Bicocca, Milan, Italy .

Published: October 2017

The string graph for a collection of next-generation reads is a lossless data representation that is fundamental for de novo assemblers based on the overlap-layout-consensus paradigm. In this article, we explore a novel approach to compute the string graph, based on the FM-index and Burrows and Wheeler Transform. We describe a simple algorithm that uses only the FM-index representation of the collection of reads to construct the string graph, without accessing the input reads. Our algorithm has been integrated into the string graph assembler (SGA) as a standalone module to construct the string graph. The new integrated assembler has been assessed on a standard benchmark, showing that fast string graph (FSG) is significantly faster than SGA while maintaining a moderate use of main memory, and showing practical advantages in running FSG on multiple threads. Moreover, we have studied the effect of coverage rates on the running times.

Download full-text PDF

Source
http://dx.doi.org/10.1089/cmb.2017.0089DOI Listing

Publication Analysis

Top Keywords

string graph
28
fast string
8
construct string
8
string
7
graph
7
fsg fast
4
graph construction
4
construction novo
4
novo assembly
4
assembly string
4

Similar Publications

Improving Molecular Design with Direct Inverse Analysis of QSAR/QSPR Model.

Mol Inform

January 2025

Department of Applied Chemistry, School of Science and Technology, Meiji University, 1-1-1 Higashi-Mita, Tama-ku, Kawasaki, Kanagawa 214-8571, Japan.

Recent advances in machine learning have significantly impacted molecular design, notably the molecular generation method combining the chemical variational autoencoder (VAE) with Gaussian mixture regression (GMR). In this method, a mathematical model is constructed with X as the latent variable of the molecule and Y as the target properties and activities. Through direct inverse analysis of this model, it is possible to generate molecules with the desired target properties.

View Article and Find Full Text PDF

Improving generalizability of drug-target binding prediction by pre-trained multi-view molecular representations.

Bioinformatics

December 2024

School of Information Science and Technology, Institute of Computational Biology, Northeast Normal University, Changchun, Jilin 130117, China.

Motivation: Most drugs start on their journey inside the body by binding the right target proteins. This is the reason that numerous efforts have been devoted to predicting the drug-target binding during drug development. However, the inherent diversity among molecular properties, coupled with limited training data availability, poses challenges to the accuracy and generalizability of these methods beyond their training domain.

View Article and Find Full Text PDF

The demand for innovative synthetic polymers with improved properties is high, but their structural complexity and vast design space hinder rapid discovery. Machine learning-guided molecular design is a promising approach to accelerate polymer discovery. However, the scarcity of labeled polymer data and the complex hierarchical structure of synthetic polymers make generative design particularly challenging.

View Article and Find Full Text PDF

In the field of chemical structure recognition, the task of converting molecular images into machine-readable data formats such as SMILES string stands as a significant challenge, primarily due to the varied drawing styles and conventions prevalent in chemical literature. To bridge this gap, we proposed MolNexTR, a novel image-to-graph deep learning model that collaborates to fuse the strengths of ConvNext, a powerful Convolutional Neural Network variant, and Vision-TRansformer. This integration facilitates a more detailed extraction of both local and global features from molecular images.

View Article and Find Full Text PDF

GIN-TONIC: non-hierarchical full-text indexing for graph genomes.

NAR Genom Bioinform

December 2024

Biomathematics and Statistics Scotland, The James Hutton Institute, Peter Guthrie Tait Road, EH9 3FD, Edinburgh, United Kingdom.

This paper presents a new data structure, GIN-TONIC (raph dexing hrough ptimal ear nterval ompaction), designed to index arbitrary string-labelled directed graphs representing, for instance, pangenomes or transcriptomes. GIN-TONIC provides several capabilities not offered by other graph-indexing methods based on the FM-Index. It is non-hierarchical, handling a graph as a monolithic object; it indexes at nucleotide resolution all possible walks in the graph without the need to explicitly store them; it supports exact substring queries in polynomial time and space for all possible walk roots in the graph, even if there are exponentially many walks corresponding to such roots.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!