Efficient string similarity join in multi-core and distributed systems.

PLoS One

School of Computer Science and Technology, Donghua University, Shanghai, China.

Published: September 2017

In big data area a significant challenge about string similarity join is to find all similar pairs more efficiently. In this paper, we propose a parallel processing framework for efficient string similarity join. First, the input is split into some disjoint small subsets according to the joint frequency distribution and the interval distribution of strings. Then the filter-verification strategy is adopted in the computation of string similarity for each subset so that the number of candidate pairs is reduced before an effective pruning strategy is used to improve the performance. Finally, the operation of string join is executed in parallel. Para-Join algorithm based on the multi-threading technique is proposed to implement the framework in a multi-core system while Pada-Join algorithm based on Spark platform is proposed to implement the framework in a cluster system. We prove that Para-Join and Pada-Join cannot only avoid reduplicate computation but also ensure the completeness of the result. Experimental results show that Para-Join can achieve high efficiency and significantly outperform than state-of-the-art approaches, meanwhile, Pada-Join can work on large datasets.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5344375PMC
http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0172526PLOS

Publication Analysis

Top Keywords

string similarity
16
similarity join
12
efficient string
8
algorithm based
8
proposed implement
8
implement framework
8
similarity
4
join
4
join multi-core
4
multi-core distributed
4

Similar Publications

Background: Alzheimer's disease (AD) is a neurogenerative disease that affect millions worldwide with no effective treatment. Several studies have been conducted to decipher to genomic underpinnings of AD. Due to its complex nature, many genes have been found to be associated with AD.

View Article and Find Full Text PDF

Background: Laparoscopic anterior resection (LAR) with Natural Orifice Specimen Extraction (NOSE) has shown benefits such as reduced pain, fewer wound complications, and improved cosmesis. In colorectal anastomosis during NOSE, double staple anastomosis (DSA) and triple stapled technique (TSA) are common. However, a novel single stapled anastomosis (SSA) technique, utilising two laparoscopically placed purse strings and only four 5 mm ports, has emerged.

View Article and Find Full Text PDF

Pleiotropic effects of mutant huntingtin on retinopathy in two mouse models of Huntington's disease.

Neurobiol Dis

December 2024

Department of Physiology & Neuroscience, Zilkha Neurogenetic Institute, Keck School of Medicine, University of Southern California, Los Angeles, CA, USA. Electronic address:

Huntington's disease (HD) is caused by the expansion of a CAG repeat, encoding a string of glutamines (polyQ) in the first exon of the huntingtin gene (HTTex1). This mutant huntingtin protein (mHTT) with extended polyQ forms aggregates in cortical and striatal neurons, causing cell damage and death. The retina is part of the central nervous system (CNS), and visual deficits and structural abnormalities in the retina of HD patients have been observed.

View Article and Find Full Text PDF

Researchers are increasingly conducting research using primary source data involving observation of, and exposure to, violent extremist individuals, their acts, their online content, and the ideologies that they act in support of. Of concern is that this increased use of primary source material has not occurred alongside a serious investigation of the traumatic outcomes that may result from constant exposure to such materials within the process of conducting academic research. As such, the goal of this review is to conduct a rapid evidence assessment to identify (a) What theories currently exist that conceptualize trauma stemming from vicarious observation of extremist atrocities? (b) In what similar domains (if any) have researchers conceptualized the trauma that stems from vicarious observation of extremist atrocities? (c) What is the current evidence base for these theories? And (d) What are the immediate research needs to extend this research and support the research workforce? Articles were identified using search strings related to types of trauma, and relevant domains of work (e.

View Article and Find Full Text PDF

Introduction: BIRC5 (Survivin) is a crucial anti-apoptotic protein overexpressed in various cancers, promoting tumor growth and treatment resistance. This study investigates its expression across 33 cancer types and explores its diagnostic, prognostic, and immune-related significance.

Methods: We analyzed RNA-seq data from TCGA and protein expression data from the Human Protein Atlas.

View Article and Find Full Text PDF

Want AI Summaries of new PubMed Abstracts delivered to your In-box?

Enter search terms and have AI summaries delivered each week - change queries or unsubscribe any time!