Data visualizations convert numbers into visual marks so that our visual system can extract data from an image instead of raw numbers. Clearly, the visual system does not compute these values as a computer would, as an arithmetic mean or a correlation. Instead, it extracts these patterns using perceptual proxies; heuristic shortcuts of the visual marks, such as a center of mass or a shape envelope.
View Article and Find Full Text PDFThe MinHash algorithm has proven effective for rapidly estimating the resemblance of two genomes or metagenomes. However, this method cannot reliably estimate the containment of a genome within a metagenome. Here, we describe an online algorithm capable of measuring the containment of genomes and proteomes within either assembled or unassembled sequencing read sets.
View Article and Find Full Text PDFPerceptual tasks in visualizations often involve comparisons. Of two sets of values depicted in two charts, which set had values that were the highest overall? Which had the widest range? Prior empirical work found that the performance on different visual comparison tasks (e.g.
View Article and Find Full Text PDFMicrobiome
November 2018
When performing bioforensic casework, it is important to be able to reliably detect the presence of a particular organism in a metagenomic sample, even if the organism is only present in a trace amount. For this task, it is common to use a sequence classification program that determines the taxonomic affiliation of individual sequence reads by comparing them to reference database sequences. As metagenomic data sets often consist of millions or billions of reads that need to be compared to reference databases containing millions of sequences, such sequence classification programs typically use search heuristics and databases with reduced sequence diversity to speed up the analysis, which can lead to incorrect assignments.
View Article and Find Full Text PDFMash extends the MinHash dimensionality-reduction technique to include a pairwise mutation distance and P value significance test, enabling the efficient clustering and search of massive sequence collections. Mash reduces large sequences and sequence sets to small, representative sketches, from which global mutation distances can be rapidly estimated. We demonstrate several use cases, including the clustering of all 54,118 NCBI RefSeq genomes in 33 CPU h; real-time database search using assembled or unassembled Illumina, Pacific Biosciences, and Oxford Nanopore data; and the scalable clustering of hundreds of metagenomic samples by composition.
View Article and Find Full Text PDFWhole-genome sequences are now available for many microbial species and clades, however existing whole-genome alignment methods are limited in their ability to perform sequence comparisons of multiple sequences simultaneously. Here we present the Harvest suite of core-genome alignment and visualization tools for the rapid and simultaneous analysis of thousands of intraspecific microbial strains. Harvest includes Parsnp, a fast core-genome multi-aligner, and Gingr, a dynamic visual platform.
View Article and Find Full Text PDFThe Bacillus anthracis Carbosap genome, which includes the pXO1 and pXO2 plasmids, has been shown to encode the major B. anthracis virulence factors, yet this strain's attenuation has not yet been explained. Here we report the draft genome sequence of this strain, and a comparison to fully virulent B.
View Article and Find Full Text PDFBackground: Although genome-wide transcriptional analysis has been used for many years to study bacterial gene expression, many aspects of the bacterial transcriptome remain undefined. One example is antisense transcription, which has been observed in a number of bacteria, though the function of antisense transcripts, and their distribution across the bacterial genome, is still unclear.
Methodology/principal Findings: Single-stranded RNA-seq results revealed a widespread and non-random pattern of antisense transcription covering more than two thirds of the B.
Background: A critical output of metagenomic studies is the estimation of abundances of taxonomical or functional groups. The inherent uncertainty in assignments to these groups makes it important to consider both their hierarchical contexts and their prediction confidence. The current tools for visualizing metagenomic data, however, omit or distort quantitative hierarchical relationships and lack the facility for displaying secondary variables.
View Article and Find Full Text PDFSummary: Bisulfite sequencing allows cytosine methylation, an important epigenetic marker, to be detected via nucleotide substitutions. Since the Applied Biosystems SOLiD System uses a unique di-base encoding that increases confidence in the detection of nucleotide substitutions, it is a potentially advantageous platform for this application. However, the di-base encoding also makes reads with many nucleotide substitutions difficult to align to a reference sequence with existing tools, preventing the platform's potential utility for bisulfite sequencing from being realized.
View Article and Find Full Text PDFAlthough gene expression has been studied in bacteria for decades, many aspects of the bacterial transcriptome remain poorly understood. Transcript structure, operon linkages, and information on absolute abundance all provide valuable insights into gene function and regulation, but none has ever been determined on a genome-wide scale for any bacterium. Indeed, these aspects of the prokaryotic transcriptome have been explored on a large scale in only a few instances, and consequently little is known about the absolute composition of the mRNA population within a bacterial cell.
View Article and Find Full Text PDFUnlabelled: Here, we report the development of SOCS (short oligonucleotide color space), a program designed for efficient and flexible mapping of Applied Biosystems SOLiD sequence data onto a reference genome. SOCS performs its mapping within the context of 'color space', and it maximizes usable data by allowing a user-specified number of mismatches. Sequence census functions facilitate a variety of functional genomics applications, including transcriptome mapping and profiling, as well as ChIP-Seq.
View Article and Find Full Text PDF