Fast and efficient short read mapping based on a succinct hash index.

Haowen Zhang Yuandong Chan Kaichao Fan Bertil Schmidt Weiguo Liu

BMC Bioinformatics

School of Software, Shandong University, Shunhua Road 1500, Jinan, Shandong, China.

Published: March 2018

Background: Various indexing techniques have been applied by next generation sequencing read mapping tools. The choice of a particular data structure is a trade-off between memory consumption, mapping throughput, and construction time.

Results: We present the succinct hash index - a novel data structure for read mapping which is a variant of the classical q-gram index with a particularly small memory footprint occupying between 3.5 and 5.3 GB for a human reference genome for typical parameter settings. The succinct hash index features two novel seed selection algorithms (group seeding and variable-length seeding) and an efficient parallel construction algorithm, which we have implemented to design the FEM (Fast(F) and Efficient(E) read Mapper(M)) mapper. FEM can return all read mappings within a given edit distance. Our experimental results show that FEM is scalable and outperforms other state-of-the-art all-mappers in terms of both speed and memory footprint. Compared to Masai, FEM is an order-of-magnitude faster using a single thread and two orders-of-magnitude faster when using multiple threads. Furthermore, we observe an up to 2.8-fold speedup compared to BitMapper and an order-of-magnitude speedup compared to BitMapper2 and Hobbes3.

Conclusions: The presented succinct index is the first feasible implementation of the q-gram index functionality that occupies around 3.5 GB of memory for a whole human reference genome. FEM is freely available at https://github.com/haowenz/FEM .

Download full-text PDF	Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5845352	PMC
http://dx.doi.org/10.1186/s12859-018-2094-5	DOI Listing

Publication Analysis

Top Keywords

read mapping

succinct hash

data structure

memory footprint

human reference

reference genome

speedup compared

read

fem

fast efficient

Similar Publications

Advancing long-read nanopore genome assembly and accurate variant calling for rare disease detection.

Am J Hum Genet

January 2025

UC Santa Cruz Genomics Institute, University of California, Santa Cruz, Santa Cruz, CA, USA. Electronic address:

Shloka Negi Sarah L Stenton Seth I Berger Paolo Canigiula Brandy McNulty

More than 50% of families with suspected rare monogenic diseases remain unsolved after whole-genome analysis by short-read sequencing (SRS). Long-read sequencing (LRS) could help bridge this diagnostic gap by capturing variants inaccessible to SRS, facilitating long-range mapping and phasing and providing haplotype-resolved methylation profiling. To evaluate LRS's additional diagnostic yield, we sequenced a rare-disease cohort of 98 samples from 41 families, using nanopore sequencing, achieving per sample ∼36× average coverage and 32-kb read N50 from a single flow cell.

View Article and Find Full Text PDF

Similar Publications

MeStanG-Resource for High-Throughput Sequencing Standard Data Sets Generation for Bioinformatic Methods Evaluation and Validation.

Biology (Basel)

January 2025

Institute for Biosecurity and Microbial Forensics (IBMF), Oklahoma State University, Stillwater, OK 74078, USA.

Daniel Ramos Lopez Francisco J Flores Andres S Espindola

Metagenomics analysis has enabled the measurement of the microbiome diversity in environmental samples without prior targeted enrichment. Functional and phylogenetic studies based on microbial diversity retrieved using HTS platforms have advanced from detecting known organisms and discovering unknown species to applications in disease diagnostics. Robust validation processes are essential for test reliability, requiring standard samples and databases deriving from real samples and in silico generated artificial controls.

View Article and Find Full Text PDF

Similar Publications

Assessing spatially explicit long-term landscape dynamics based on automated production of land category layers from Danish late nineteenth-century topographic maps in comparison with contemporary maps.

Environ Monit Assess

January 2025

Royal Danish Library, Special Collections, Søren Kierkegaards Plads. 1, 1221, Copenhagen K, Denmark.

Gregor Levin Geoff Groom Stig Roar Svenningsen

Historical topographical maps contain valuable, spatially and thematically detailed information about past landscapes. Yet, for analyses of landscape dynamics through geographical information systems, it is necessary to "unlock" this information via map processing. For two study areas in northern and central Jutland, Denmark, we apply object-based image analysis, vector GIS, colour image segmentation, and machine learning processes to produce machine-readable layers for the land use and land cover categories forest, wetland, heath, dune sand, and water bodies from topographic maps from the late nineteenth century.

View Article and Find Full Text PDF

Similar Publications

Distinct Behavioural and Brain Response Profiles Between Arithmetic Word Problem Solving and Sentence Comprehension in Third and Fourth Graders.

Eur J Neurosci

January 2025

Department of Psychology, National Chengchi University, Taipei, Taiwan.

Chan-Tat Ng Xin-Yu Chen Ting-Ting Chang

Word problems are essential for math learning and education, bridging numerical knowledge with real-world applications. Despite their importance, the neural mechanisms underlying word problem solving, especially in children, remain poorly understood. Here, we examine children's cognitive and brain response profiles for arithmetic word problems (AWPs), which involve one-step mathematical operations, and compare them with nonarithmetic word problems (NWPs), structured as parallel narratives without numerical operations.

View Article and Find Full Text PDF

Similar Publications

LipBengal: Pioneering Bengali lip-reading dataset for pronunciation mapping through lip gestures.

Data Brief

February 2025

Department of Electrical, Electronic and Communication Engineering, Military Institute of Science and Technology (MIST), Dhaka 1216, Bangladesh.

Md Tanvir Rahman Sahed Md Tanjil Islam Aronno Hussain Nyeem Md Abdul Wahed Tashrif Ahsan

The dataset represents a significant advancement in Bengali lip-reading and visual speech recognition research, poised to drive future applications and technological progress. Despite Bengali's global status as the seventh most spoken language with approximately 265 million speakers, linguistically rich and widely spoken languages like Bengali have been largely overlooked by the research community. fills this gap by offering a pioneering dataset tailored for Bengali lip-reading, comprising visual data from 150 speakers across 54 classes, encompassing Bengali phonemes, alphabets, and symbols.

View Article and Find Full Text PDF

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!