A PHP Error was encountered

Severity: Warning

Message: file_get_contents(https://...@gmail.com&api_key=61f08fa0b96a73de8c900d749fcb997acc09): Failed to open stream: HTTP request failed! HTTP/1.1 429 Too Many Requests

Filename: helpers/my_audit_helper.php

Line Number: 143

Backtrace:

File: /var/www/html/application/helpers/my_audit_helper.php
Line: 143
Function: file_get_contents

File: /var/www/html/application/helpers/my_audit_helper.php
Line: 209
Function: simplexml_load_file_from_url

File: /var/www/html/application/helpers/my_audit_helper.php
Line: 3098
Function: getPubMedXML

File: /var/www/html/application/controllers/Detail.php
Line: 574
Function: pubMedSearch_Global

File: /var/www/html/application/controllers/Detail.php
Line: 488
Function: pubMedGetRelatedKeyword

File: /var/www/html/index.php
Line: 316
Function: require_once

A PHP Error was encountered

Severity: Warning

Message: Attempt to read property "Count" on bool

Filename: helpers/my_audit_helper.php

Line Number: 3100

Backtrace:

File: /var/www/html/application/helpers/my_audit_helper.php
Line: 3100
Function: _error_handler

File: /var/www/html/application/controllers/Detail.php
Line: 574
Function: pubMedSearch_Global

File: /var/www/html/application/controllers/Detail.php
Line: 488
Function: pubMedGetRelatedKeyword

File: /var/www/html/index.php
Line: 316
Function: require_once

MeShClust v3.0: high-quality clustering of DNA sequences using the mean shift algorithm and alignment-free identity scores. | LitMetric

MeShClust v3.0: high-quality clustering of DNA sequences using the mean shift algorithm and alignment-free identity scores.

BMC Genomics

Bioinformatics Toolsmith Laboratory, Department of Electrical Engineering and Computer Science, Texas A&M University-Kingsville, Kingsville, TX, USA.

Published: June 2022

AI Article Synopsis

  • Tools for clustering biological sequences like CD-HIT and UCLUST are key in computational biology but have room for improvement in cluster quality, leading to the development of MeShClust v3.0 which uses the mean shift algorithm for better accuracy.
  • MeShClust v3.0 was tested against CD-HIT, MeShClust v1.0, and UCLUST across synthetic and real data sets and showed superior performance, achieving a 55%-300% improvement in cluster quality on certain real data sets.
  • Although MeShClust v3.0 requires more time and memory, it remains accessible for most personal computers and accurately estimates key parameters influencing cluster membership.

Article Abstract

Background: Tools for accurately clustering biological sequences are among the most important tools in computational biology. Two pioneering tools for clustering sequences are CD-HIT and UCLUST, both of which are fast and consume reasonable amounts of memory; however, there is a big room for improvement in terms of cluster quality. Motivated by this opportunity for improving cluster quality, we applied the mean shift algorithm in MeShClust v1.0. The mean shift algorithm is an instance of unsupervised learning. Its strong theoretical foundation guarantees the convergence to the true cluster centers. Our implementation of the mean shift algorithm in MeShClust v1.0 was a step forward. In this work, we scale up the algorithm by adapting an out-of-core strategy while utilizing alignment-free identity scores in a new tool: MeShClust v3.0.

Results: We evaluated CD-HIT, MeShClust v1.0, MeShClust v3.0, and UCLUST on 22 synthetic sets and five real sets. These data sets were designed or selected for testing the tools in terms of scalability and different similarity levels among sequences comprising clusters. On the synthetic data sets, MeShClust v3.0 outperformed the related tools on all sets in terms of cluster quality. On two real data sets obtained from human microbiome and maize transposons, MeShClust v3.0 outperformed the related tools by wide margins, achieving 55%-300% improvement in cluster quality. On another set that includes degenerate viral sequences, MeShClust v3.0 came third. On two bacterial sets, MeShClust v3.0 was the only applicable tool because of the long sequences in these sets. MeShClust v3.0 requires more time and memory than the related tools; almost all personal computers at the time of this writing can accommodate such requirements. MeShClust v3.0 can estimate an important parameter that controls cluster membership with high accuracy.

Conclusions: These results demonstrate the high quality of clusters produced by MeShClust v3.0 and its ability to apply the mean shift algorithm to large data sets and long sequences. Because clustering tools are utilized in many studies, providing high-quality clusters will help with deriving accurate biological knowledge.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC9171953PMC
http://dx.doi.org/10.1186/s12864-022-08619-0DOI Listing

Publication Analysis

Top Keywords

meshclust v30
36
shift algorithm
20
cluster quality
16
data sets
16
meshclust
13
meshclust v10
12
sets meshclust
12
sets
9
alignment-free identity
8
identity scores
8

Similar Publications

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!

A PHP Error was encountered

Severity: Notice

Message: fwrite(): Write of 34 bytes failed with errno=28 No space left on device

Filename: drivers/Session_files_driver.php

Line Number: 272

Backtrace:

A PHP Error was encountered

Severity: Warning

Message: session_write_close(): Failed to write session data using user defined save handler. (session.save_path: /var/lib/php/sessions)

Filename: Unknown

Line Number: 0

Backtrace: