Background: Oligonucleotide signatures (signatures) have been widely used for studying microbial diversity and function in wet-lab settings, but using them for accurate in silico identification of organisms from high-throughput sequencing (HTS) data is only a proof of concept. Existing signature design programs for sequence signatures (signatures matching exactly one sequence) or clade signatures (signatures matching every sequence in a phylogenetic clade) are not able to identify all possible polymorphic sites for sequences with high similarity and perform poorly when handling large genome sequencing datasets.
Results: We introduce cluster signatures: subsequences that match perfectly and exclusively any group of sequences in a data set. Cluster signatures provide complete recall for primer/probe design and increased discrimination between sequences beyond that of clade signatures. Using cluster signatures for in silico identification of HTS targets achieves good precision/recall and running time performance. This method has been implemented into an open source tool, the Automated Oligonucleotide Design Pipeline (adop), included in supplementary material and available at: https://bitbucket.org/wenchen_aafc/aodp_v2.0_release .
Conclusions: Cluster signatures provide a rapid and universal analysis tool to identify all possible short diagnostic DNA markers and variants from any DNA sequencing dataset. They are particularly useful in discriminating genetic material from closely related organisms and in detecting deleterious mutations in highly or perfectly conserved genomic sites.
Download full-text PDF |
Source |
---|---|
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6284311 | PMC |
http://dx.doi.org/10.1186/s12859-018-2363-3 | DOI Listing |
Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!