SamSelect: a sample sequence selection algorithm for quorum planted motif search on large DNA datasets.

BMC Bioinformatics

School of Computer Science and Technology, Xidian University, Xi'an, 710071, China.

Published: June 2018

The quorum planted motif search (qPMS) is used to find specific l-length DNA motifs in a set of DNA sequences, aiming to identify transcription factor binding sites, but struggles with large datasets like ChIP-seq.
Research shows that larger numbers of sequences (t) or lower prevalence thresholds (q) lead to longer computation times for qPMS algorithms.
The proposed method, called SamSelect, efficiently selects a smaller, high-quality sample sequence set, allowing qPMS algorithms to identify motifs faster in this reduced dataset compared to the original larger datasets.

Background: Given a set of t n-length DNA sequences, q satisfying 0 < q ≤ 1, and l and d satisfying 0 ≤ d < l < n, the quorum planted motif search (qPMS) finds l-length strings that occur in at least qt input sequences with up to d mismatches and is mainly used to locate transcription factor binding sites in DNA sequences. Existing qPMS algorithms have been able to efficiently process small standard datasets (e.g., t = 20 and n = 600), but they are too time consuming to process large DNA datasets, such as ChIP-seq datasets that contain thousands of sequences or more.

Results: We analyze the effects of t and q on the time performance of qPMS algorithms and find that a large t or a small q causes a longer computation time. Based on this information, we improve the time performance of existing qPMS algorithms by selecting a sample sequence set D' with a small t and a large q from the large input dataset D and then executing qPMS algorithms on D'. A sample sequence selection algorithm named SamSelect is proposed. The experimental results on both simulated and real data show (1) that SamSelect can select D' efficiently and (2) that the qPMS algorithms executed on D' can find implanted or real motifs in a significantly shorter time than when executed on D.

Conclusions: We improve the ability of existing qPMS algorithms to process large DNA datasets from the perspective of selecting high-quality sample sequence sets so that the qPMS algorithms can find motifs in a short time in the selected sample sequence set D', rather than take an unfeasibly long time to search the original sequence set D. Our motif discovery method is an approximate algorithm.

Download full-text PDF	Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6006848	PMC
http://dx.doi.org/10.1186/s12859-018-2242-y	DOI Listing

Publication Analysis

Top Keywords

samselect sample

sample sequence

sequence selection

selection algorithm

algorithm quorum

quorum planted

planted motif

motif search

search large

large dna

Similar Publications

SamSelect: a sample sequence selection algorithm for quorum planted motif search on large DNA datasets.

BMC Bioinformatics

June 2018

School of Computer Science and Technology, Xidian University, Xi'an, 710071, China.

Qiang Yu Dingbang Wei Hongwei Huo

Article Synopsis

The quorum planted motif search (qPMS) is used to find specific l-length DNA motifs in a set of DNA sequences, aiming to identify transcription factor binding sites, but struggles with large datasets like ChIP-seq.
Research shows that larger numbers of sequences (t) or lower prevalence thresholds (q) lead to longer computation times for qPMS algorithms.
The proposed method, called SamSelect, efficiently selects a smaller, high-quality sample sequence set, allowing qPMS algorithms to identify motifs faster in this reduced dataset compared to the original larger datasets.

View Article and Find Full Text PDF

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!