BMC Bioinformatics
November 2024
Background: The maximal sensitivity for local pairwise alignment makes the Smith-Waterman algorithm a popular choice for protein sequence database search. However, its quadratic time complexity makes it compute-intensive. Unfortunately, current state-of-the-art software tools are not able to leverage the massively parallel processing capabilities of modern GPUs with close-to-peak performance.
View Article and Find Full Text PDFStud Health Technol Inform
August 2024
Introduction: Retrieving comprehensible rule-based knowledge from medical data by machine learning is a beneficial task, e.g., for automating the process of creating a decision support system.
View Article and Find Full Text PDFBackground: Commonly used next generation sequencing machines typically produce large amounts of short reads of a few hundred base-pairs in length. However, many downstream applications would generally benefit from longer reads.
Results: We present CAREx-an algorithm for the generation of pseudo-long reads from paired-end short-read Illumina data based on the concept of repeatedly computing multiple-sequence-alignments to extend a read until its partner is found.
Drug Discov Today
June 2024
The enormous growth in the amount of data generated by the life sciences is continuously shifting the field from model-driven science towards data-driven science. The need for efficient processing has led to the adoption of massively parallel accelerators such as graphics processing units (GPUs). Consequently, the development of bioinformatics methods nowadays often heavily depends on the effective use of these powerful technologies.
View Article and Find Full Text PDFSummary: We propose RabbitKSSD, a high-speed genome distance estimation tool. Specifically, we leverage load-balanced task partitioning, fast I/O, efficient intermediate result accesses, and high-performance data structures to improve overall efficiency. Our performance evaluation demonstrates that RabbitKSSD achieves speedups ranging from 5.
View Article and Find Full Text PDFDeep learning has emerged as a paradigm that revolutionizes numerous domains of scientific research. Transformers have been utilized in language modeling outperforming previous approaches. Therefore, the utilization of deep learning as a tool for analyzing the genomic sequences is promising, yielding convincing results in fields such as motif identification and variant calling.
View Article and Find Full Text PDFAssessing the quality of sequencing data plays a crucial role in downstream data analysis. However, existing tools often achieve sub-optimal efficiency, especially when dealing with compressed files or performing complicated quality control operations such as over-representation analysis and error correction. We present RabbitQCPlus, an ultra-efficient quality control tool for modern multi-core systems.
View Article and Find Full Text PDFWe present RabbitTClust, a fast and memory-efficient genome clustering tool based on sketch-based distance estimation. Our approach enables efficient processing of large-scale datasets by combining dimensionality reduction techniques with streaming and parallelization on modern multi-core platforms. 113,674 complete bacterial genome sequences from RefSeq, 455 GB in FASTA format, can be clustered within less than 6 min and 1,009,738 GenBank assembled bacterial genomes, 4.
View Article and Find Full Text PDFIEEE/ACM Trans Comput Biol Bioinform
June 2023
The continuous growth of generated sequencing data leads to the development of a variety of associated bioinformatics tools. However, many of them are not able to fully exploit the resources of modern multi-core systems since they are bottlenecked by parsing files leading to slow execution times. This motivates the design of an efficient method for parsing sequencing data that can exploit the power of modern hardware, especially for modern CPUs with fast storage devices.
View Article and Find Full Text PDFBackground: Mass spectrometry is an important experimental technique in the field of proteomics. However, analysis of certain mass spectrometry data faces a combination of two challenges: first, even a single experiment produces a large amount of multi-dimensional raw data and, second, signals of interest are not single peaks but patterns of peaks that span along the different dimensions. The rapidly growing amount of mass spectrometry data increases the demand for scalable solutions.
View Article and Find Full Text PDFBackground: Next-generation sequencing pipelines often perform error correction as a preprocessing step to obtain cleaned input data. State-of-the-art error correction programs are able to reliably detect and correct the majority of sequencing errors. However, they also introduce new errors by making false-positive corrections.
View Article and Find Full Text PDFMotivation: Detection and identification of viruses and microorganisms in sequencing data plays an important role in pathogen diagnosis and research. However, existing tools for this problem often suffer from high runtimes and memory consumption.
Results: We present RabbitV, a tool for rapid detection of viruses and microorganisms in Illumina sequencing datasets based on fast identification of unique k-mers.
Next-generation sequencing (NGS) methods lie at the heart of large parts of biological and medical research. Their fundamental importance has created a continuously increasing demand for processing and analysis methods of the data sets produced, addressing questions such as variant calling, metagenomic classification and quantification, genomic feature detection, or downstream analysis in larger biological or medical contexts. In addition to classical algorithmic approaches, machine-learning (ML) techniques are often used for such tasks.
View Article and Find Full Text PDFThe progress of next-generation sequencing has lead to the availability of massive data sets used by a wide range of applications in biology and medicine. This has sparked significant interest in using modern Big Data technologies to process this large amount of information in distributed memory clusters of commodity hardware. Several approaches based on solutions such as Apache Hadoop or Apache Spark, have been proposed.
View Article and Find Full Text PDFObjectives: To develop and evaluate a deep learning algorithm for fully automated detection of primary sclerosing cholangitis (PSC)-compatible cholangiographic changes on three-dimensional magnetic resonance cholangiopancreatography (3D-MRCP) images.
Methods: The datasets of 428 patients (n = 205 with confirmed diagnosis of PSC; n = 223 non-PSC patients) referred for MRI including MRCP were included in this retrospective IRB-approved study. Datasets were randomly assigned to a training (n = 386) and a validation group (n = 42).
Purpose: To create a fully automated, reliable, and fast segmentation tool for Gd-EOB-DTPA-enhanced MRI scans using deep learning.
Materials And Methods: Datasets of Gd-EOB-DTPA-enhanced liver MR images of 100 patients were assembled. Ground truth segmentation of the hepatobiliary phase images was performed manually.
Motivation: Mash is a popular hash-based genome analysis toolkit with applications to important downstream analyses tasks such as clustering and assembly. However, Mash is currently not able to fully exploit the capabilities of modern multi-core architectures, which in turn leads to high runtimes for large-scale genomic datasets.
Results: We present RabbitMash, an efficient highly optimized implementation of Mash which can take full advantage of modern hardware including multi-threading, vectorization and fast I/O.
Motivation: Error correction is a fundamental pre-processing step in many Next-Generation Sequencing (NGS) pipelines, in particular for de novo genome assembly. However, existing error correction methods either suffer from high false-positive rates since they break reads into independent k-mers or do not scale efficiently to large amounts of sequencing reads and complex genomes.
Results: We present CARE-an alignment-based scalable error correction algorithm for Illumina data using the concept of minhashing.
Motivation: Modern sequencing technologies continue to revolutionize many areas of biology and medicine. Since the generated datasets are error-prone, downstream applications usually require quality control methods to pre-process FASTQ files. However, existing tools for this task are currently not able to fully exploit the capabilities of computing platforms leading to slow runtimes.
View Article and Find Full Text PDFBackground: Obtaining data from single-cell transcriptomic sequencing allows for the investigation of cell-specific gene expression patterns, which could not be addressed a few years ago. With the advancement of droplet-based protocols the number of studied cells continues to increase rapidly. This establishes the need for software tools for efficient processing of the produced large-scale datasets.
View Article and Find Full Text PDFBackground: All-Food-Sequencing (AFS) is an untargeted metagenomic sequencing method that allows for the detection and quantification of food ingredients including animals, plants, and microbiota. While this approach avoids some of the shortcomings of targeted PCR-based methods, it requires the comparison of sequence reads to large collections of reference genomes. The steadily increasing amount of available reference genomes establishes the need for efficient big data approaches.
View Article and Find Full Text PDFModification mapping from cDNA data has become a tremendously important approach in epitranscriptomics. So-called reverse transcription signatures in cDNA contain information on the position and nature of their causative RNA modifications. Data mining of, e.
View Article and Find Full Text PDFJ Magn Reson Imaging
February 2020
Background: Chronic obstructive pulmonary disease (COPD) is associated with high morbidity and mortality. Identification of imaging biomarkers for phenotyping is necessary for future treatment and therapy monitoring. However, translation of visual analytic pipelines into clinics or their use in large-scale studies is significantly slowed by time-consuming postprocessing steps.
View Article and Find Full Text PDFBioinformatics
December 2019
Motivation: K-mers along with their frequency have served as an elementary building block for error correction, repeat detection, multiple sequence alignment, genome assembly, etc., attracting intensive studies in k-mer counting. However, the output of k-mer counters itself is large; very often, it is too large to fit into main memory, leading to highly narrowed usability.
View Article and Find Full Text PDFMotivation: Modern bioinformatics tools for analyzing large-scale NGS datasets often need to include fast implementations of core sequence alignment algorithms in order to achieve reasonable execution times. We address this need by presenting the BGSA toolkit for optimized implementations of popular bit-parallel global pairwise alignment algorithms on modern microprocessors.
Results: BGSA outperforms Edlib, SeqAn and BitPAl for pairwise edit distance computations and Parasail, SeqAn and BitPAl when using more general scoring schemes for pairwise alignments of a batch of sequence reads on both standard multi-core CPUs and Xeon Phi many-core CPUs.