Publications by authors named "Mingfu Shao"

Modern sequencing technologies allow for the addition of short-sequence tags, known as anchors, to both ends of a captured molecule. Anchors are useful in assembling the full-length sequence of a captured molecule as they can be used to accurately determine the endpoints. One representative of such anchor-enabled technology is LoopSeq Solo, a synthetic long read (SLR) sequencing protocol.

View Article and Find Full Text PDF

A tandem repeat is a sequence of nucleotides that occurs as multiple contiguous and near-identical copies positioned next to each other. These repeats play critical roles in genetic diversity, gene regulation, and are strongly linked to various neurological and developmental disorders. While several methods exist for detecting tandem repeats, they often exhibit low accuracy when the repeat unit length increases or the number of copies is low.

View Article and Find Full Text PDF

Alternative splicing (AS) is a ubiquitous mechanism in eukaryotes. It is estimated that 90% of human genes are alternatively spliced. Despite enormous efforts, transcriptome annotations remain, nevertheless, incomplete.

View Article and Find Full Text PDF

Reducing into a satisfiability (SAT) formulation has been proven effective in solving certain NP-hard problems. In this work, we extend this research by presenting a novel SAT formulation for computing the double-cut-and-join (DCJ) distance between two genomes with duplicate genes. The DCJ distance serves as a crucial metric in studying genome rearrangement.

View Article and Find Full Text PDF

Emerging single-cell RNA sequencing techniques (scRNA-seq) has enabled the study of cellular transcriptome heterogeneity, yet accurate reconstruction of full-length transcripts at single-cell resolution remains challenging due to high dropout rates and sparse coverage. While meta-assembly approaches offer promising solutions by integrating information across multiple cells, current methods struggle to balance consensus assembly with cell-specific transcriptional signatures. Here, we present Beaver, a cell-specific transcript assembler designed for short-read scRNA-seq data.

View Article and Find Full Text PDF

Circular RNA (circRNA) is a class of RNA molecules that forms a closed loop with their 5' and 3' ends covalently bonded. CircRNAs are known to be more stable than linear RNAs, have distinct properties and functions, and are promising biomarkers. Existing methods for assembling circRNAs heavily rely on the annotated transcriptomes, hence exhibiting unsatisfactory accuracy without a high-quality transcriptome.

View Article and Find Full Text PDF

Motivation: High-throughput RNA sequencing has become indispensable for decoding gene activities, yet the challenge of reconstructing full-length transcripts persists. Traditional single-sample assemblers frequently produce fragmented transcripts, especially in single-cell RNA-seq data. While algorithms designed for assembling multiple samples exist, they encounter various limitations.

View Article and Find Full Text PDF
Article Synopsis
  • The text discusses the challenges of identifying biologically related sequences in large datasets due to the computational difficulties involved in calculating edit distance between sequences.
  • It introduces a novel approach using locality-sensitive bucketing (LSB) functions, which can efficiently group sequences based on their edit distances, potentially allowing for more manageable comparisons.
  • Additionally, the authors employed machine learning to enhance LSB functions, leading to significant improvements in accuracy compared to existing methods, demonstrating their effectiveness in practical applications like data analysis of erroneous cell barcodes.
View Article and Find Full Text PDF
Article Synopsis
  • Seeding is key for analyzing large-scale sequences, especially when dealing with high error rates in long reads, and the new method, SubseqHash2, addresses this challenge with improved accuracy.
  • SubseqHash2 enhances speed and efficiency by calculating multiple seed sets in one go, leveraging a dynamic programming framework and SIMD instructions for performance boosts of 10-50 times compared to its predecessor, SubseqHash.
  • The new algorithm outperforms popular substring-based methods (like kmers) in key applications such as read mapping, sequence alignment, and overlap detection, thus paving the way for broader use of subsequence-based seeds in long-read analysis.
View Article and Find Full Text PDF
Article Synopsis
  • Circular RNA (circRNA) is a unique type of RNA that forms a stable closed loop and has significant biological functions that were previously underestimated due to biases in RNA sequencing techniques.
  • The new algorithm, TERRACE, addresses the challenge of assembling circRNAs from RNA-seq data by using a splice graph to efficiently identify "bridging" paths and improves detection of back-spliced reads missed by other methods.
  • TERRACE outperforms existing circRNA detection methods in sensitivity and precision, particularly in cases where annotated transcriptomes are unavailable, making it a major advancement in the field.
View Article and Find Full Text PDF

Transcript annotations play a critical role in gene expression analysis as they serve as a reference for quantifying isoform-level expression. The two main sources of annotations are RefSeq and Ensembl/GENCODE, but discrepancies between their methodologies and information resources can lead to significant differences. It has been demonstrated that the choice of annotation can have a significant impact on gene expression analysis.

View Article and Find Full Text PDF

In computational biology, -mers and edit distance are fundamental concepts. However, little is known about the metric space of all -mers equipped with the edit distance. In this work, we explore the structure of the -mer space by studying its maximal independent sets (MISs).

View Article and Find Full Text PDF

The high-throughput short-reads RNA-seq protocols often produce paired-end reads, with the middle portion of the fragments being unsequenced. We explore if the full-length fragments can be computationally reconstructed from the sequenced two ends in the absence of the reference genome-a problem here we refer to as bridging. Solving this problem provides longer, more informative RNA-seq reads, and benefits downstream RNA-seq analysis such as transcript assembly, expression quantification, and splicing differential analysis.

View Article and Find Full Text PDF

Background: Many bioinformatics applications involve bucketing a set of sequences where each sequence is allowed to be assigned into multiple buckets. To achieve both high sensitivity and precision, bucketing methods are desired to assign similar sequences into the same bucket while assigning dissimilar sequences into distinct buckets. Existing k-mer-based bucketing methods have been efficient in processing sequencing data with low error rates, but encounter much reduced sensitivity on data with high error rates.

View Article and Find Full Text PDF

Motivation: Modern methods for computation-intensive tasks in sequence analysis (e.g. read mapping, sequence alignment, genome assembly, etc.

View Article and Find Full Text PDF

Motivation: Transcript annotations play a critical role in gene expression analysis as they serve as a reference for quantifying isoform-level expression. The two main sources of annotations are RefSeq and Ensembl/GENCODE, but discrepancies between their methodologies and information resources can lead to significant differences. It has been demonstrated that the choice of annotation can have a significant impact on gene expression analysis.

View Article and Find Full Text PDF

The high-throughput short-reads RNA-seq protocols often produce paired-end reads, with the middle portion of the fragments being unsequenced. We explore if the full-length fragments can be computationally reconstructed from the sequenced two ends in the absence of the reference genome - a problem here we refer to as de novo bridging. Solving this problem provides longer, more informative RNA-seq reads, and benefits downstream RNA-seq analysis such as transcript assembly, expression quantification, and splicing differential analysis.

View Article and Find Full Text PDF
Article Synopsis
  • Modern RNA-sequencing can create multi-end data, which helps in accurately mapping complex RNA splicing but lacks efficient assembly algorithms.
  • Scallop2 is a new assembly tool designed specifically for this type of data, utilizing a three-step process to improve the accuracy of RNA transcript assembly.
  • When tested on various datasets, Scallop2 showed significant improvements in assembly accuracy compared to established assemblers like StringTie2 and the original Scallop.
View Article and Find Full Text PDF

Motivation: Most modern seed-and-extend NGS read mappers employ a seeding scheme that requires extracting non-overlapping seeds in each read in order to find all valid mappings under an edit distance threshold of . As grows, this seeding scheme forces mappers to use more and shorter seeds, which increases the seed hits (seed frequencies) and therefore reduces the efficiency of mappers.

Results: We propose a novel seeding framework, context-aware seeds (CAS).

View Article and Find Full Text PDF

Single-molecule long-read sequencing has been used to improve mRNA isoform identification. However, not all single-molecule long reads represent full transcripts due to incomplete cDNA synthesis and sequencing length limits. This drives a need for long-read transcript assembly.

View Article and Find Full Text PDF

Motivated by multiple genome assembly problems and other applications, we study the following minimum path flow decomposition problem: Given a directed acyclic graph $G=(V,E)$G=(V,E) with source $s$s and sink $t$t and a flow $f$f, compute a set of $s$s-$t$t paths $P$P and assign weight $w(p)$w(p) for $p\in P$p∈P such that $f(e) = \sum _{p\in P: e\in p} w(p)$f(e)=∑p∈P:e∈pw(p), $\forall e\in E$∀e∈E, and $|P|$|P| is minimized. We develop some fundamental theory for this problem, upon which we design an efficient heuristic. Specifically, we prove that the gap between the optimal number of paths and a known upper bound is determined by the nontrivial equations within the flow values.

View Article and Find Full Text PDF

Transcripts are frequently modified by structural variations, which lead to fused transcripts of either multiple genes, known as a fusion gene, or a gene and a previously non-transcribed sequence. Detecting these modifications, called transcriptomic structural variations (TSVs), especially in cancer tumor sequencing, is an important and challenging computational problem. We introduce SQUID, a novel algorithm to predict both fusion-gene and non-fusion-gene TSVs accurately from RNA-seq alignments.

View Article and Find Full Text PDF

We introduce Scallop, an accurate reference-based transcript assembler that improves reconstruction of multi-exon and lowly expressed transcripts. Scallop preserves long-range phasing paths extracted from reads, while producing a parsimonious set of transcripts and minimizing coverage deviation. On 10 human RNA-seq samples, Scallop produces 34.

View Article and Find Full Text PDF

A fundamental problem in comparative genomics is to compute the distance between two genomes in terms of its higher level organization (given by genes or syntenic blocks). For two genomes without duplicate genes, we can easily define (and almost always efficiently compute) a variety of distance measures, but the problem is NP-hard under most models when genomes contain duplicate genes. To tackle duplicate genes, three formulations (exemplar, maximum matching, and any matching) have been proposed, all of which aim to build a matching between homologous genes so as to minimize some distance measure.

View Article and Find Full Text PDF