Please wait a minute...
Frontiers of Computer Science

ISSN 2095-2228

ISSN 2095-2236(Online)

CN 10-1014/TP

Postal Subscription Code 80-970

2018 Impact Factor: 1.129

Front. Comput. Sci.    2017, Vol. 11 Issue (2) : 307-319    https://doi.org/10.1007/s11704-016-5231-1
RESEARCH ARTICLE
String similarity join with different similarity thresholds based on novel indexing techniques
Chuitian RONG1(),Yasin N. SILVA2,Chunqing LI1
1. School of Computer Science & Software Engineering, Tianjin Polytechnic University, Tianjin 300387, China
2. School of Mathematical & Natural Sciences, Arizona State University, Tempe AZ 85281, USA
 Download: PDF(741 KB)  
 Export: BibTeX | EndNote | Reference Manager | ProCite | RefWorks
Abstract

String similarity join is an essential operation of many applications that need to find all similar string pairs from two given collections. A quantitative way to determine whether two strings are similar is to compute their similarity based on a certain similarity function. The string pairs with similarity above a certain threshold are regarded as results. The current approach to solving the similarity join problem is to use a unique threshold value. There are, however, several scenarios that require the support of multiple thresholds, for instance, when the dataset includes strings of various lengths. In this scenario, longer string pairs typically tolerate much more typos than shorter ones. Therefore, we proposed a solution for string similarity joins that supports different similarity thresholds in a single operator. In order to support different thresholds, we devised two novel indexing techniques: partition based indexing and similarity aware indexing. To utilize the new indices and improve the join performance, we proposed new filtering methods and index probing techniques. To the best of our knowledge, this is the first work that addresses this problem. Experimental results on real-world datasets show that our solution performs efficiently while providing a more flexible threshold specification.

Keywords similarity join      similarity aware index      similarity thresholds     
Corresponding Author(s): Chuitian RONG   
Just Accepted Date: 22 February 2016   Online First Date: 17 October 2016    Issue Date: 06 April 2017
 Cite this article:   
Chuitian RONG,Yasin N. SILVA,Chunqing LI. String similarity join with different similarity thresholds based on novel indexing techniques[J]. Front. Comput. Sci., 2017, 11(2): 307-319.
 URL:  
https://academic.hep.com.cn/fcs/EN/10.1007/s11704-016-5231-1
https://academic.hep.com.cn/fcs/EN/Y2017/V11/I2/307
1 Monge A, Elkan C. The field matching problem: algorithms and applications. In: Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 1996, 267–270
2 Zhang Z J, Hadjieleftheriou M, Ooi B, Srivastava D. Bed-tree: an allpurpose index structure for string similarity search based on edit distance. In: Proceedings of ACM SIGMOD International Conference on Management of Data. 2010, 915–926
https://doi.org/10.1145/1807167.1807266
3 Lu W, Du X Y, Hadjieleftheriou M, Ooi B C. Efficiently supporting edit distance based string similarity search using b+-trees. IEEE Transactions on Knowledge and Data Engineering, 2014, 26(12): 2983–2996
https://doi.org/10.1109/TKDE.2014.2309131
4 Wang J N, Feng J H, Li G L. Trie-join: efficient trie-based string similarity joins with edit-distance constraints. Proceedings of the VLDB Endowment, 2010, 3(1–2): 1219–1230
https://doi.org/10.14778/1920841.1920992
5 Sarawagi S, Kirpal A. Efficient set joins on similarity predicates. In: Proceedings of ACM SIGMOD International Conference on Management of Data. 2004, 743–754
https://doi.org/10.1145/1007568.1007652
6 Chaudhuri S, Ganti V, Kaushik R. A primitive operator for similarity joins in data cleaning. In: Proceedings of the 22nd IEEE International Conference on Data Engineering. 2006, 61–72
https://doi.org/10.1109/icde.2006.9
7 Bayardo R, Ma Y, Srikant R. Scaling up all pairs similarity search. In: Proceedings of the 16th ACM International Conference onWorldWide Web. 2007, 131–140
https://doi.org/10.1145/1242572.1242591
8 Xiao C, Wang W, Lin X M, Yu J. Efficient similarity joins for near duplicate detection. In: Proceedings of ACM International Conference on World Wide Web. 2008, 563–574
https://doi.org/10.1145/1367497.1367516
9 Hernández M, Stolfo S. The merge/purge problem for large databases. In: Proceedings of ACM SIGMOD International Conference on Management of Data. 1995, 127–138
https://doi.org/10.1145/223784.223807
10 Winkler W E. The state of record linkage and current research problems. Technical Report, Statistical Research Division, U.S. Census Bureau. 1999
11 Sivic J, Zisserman A. Video google: a text retrieval approach to object matching in videos. 2003, 1470–1477
12 Dong X, Halevy A, Madhavan J. Reference reconciliation in complex information spaces. In: Proceedings of ACM SIGMOD International Conference on Management of Data. 2005, 85–96
https://doi.org/10.1145/1066157.1066168
13 Sarawagi S, Bhamidipaty A. Interactive deduplication using active learning. In: Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2002, 269–278
https://doi.org/10.1145/775047.775087
14 Arasu A, Ré C, Suciu D. Large-scale deduplication with constraints using dedupalog. In: Proceedings of the 25th IEEE International Conference on Data Engineering. 2009, 952–963
https://doi.org/10.1109/icde.2009.43
15 Gravano L, Ipeirotis P G, Jagadish H V, Koudas N, Muthukrishnan S, Srivastava D. Approximate string joins in a database (almost) for free. In: Proceedings of the VLDB Endowment. 2001, 491–500
16 Elmagarmid A K, Ipeirotis P G, Verykios V S. Duplicate record detection: a survey. IEEE Transactions on Knowledge and Data Engineering, 2007, 19(1): 1–16
https://doi.org/10.1109/TKDE.2007.250581
17 Naumann F, Herschel M. An Introduction to duplicate detection. Synthesis Lectures on Data Management, 2010, 2(1): 1–87
https://doi.org/10.2200/S00262ED1V01Y201003DTM003
18 Jiang Y, Li G L, Feng J H, Li W S. String similarity joins: an experimental evaluation. Proceedings of the VLDB Endowment, 2014, 7(8): 625–636
https://doi.org/10.14778/2732296.2732299
19 Chaudhuri S, Kaushik R. Extending autocompletion to tolerate errors. In: Proceedings of ACM SIGMOD International Conference on Management of Data. 2009, 707–718
https://doi.org/10.1145/1559845.1559919
20 Deng D, Li G L, Feng J H. A pivotal prefix based filtering algorithm for string similarity search. In: Proceedings of ACM SIGMOD International Conference on Management of Data. 2014, 673–684
https://doi.org/10.1145/2588555.2593675
21 Wang J N, Li G L, Feng J H. Can we beat the prefix filtering?: an adaptive framework for similarity join and search. In: Proceedings of ACM SIGMOD International Conference on Management of Data. 2012, 85–96
https://doi.org/10.1145/2213836.2213847
22 Rong C T, Lu W, Wang X L, Du X Y, Chen Y G, Tung A K H. Efficient and scalable processing of string similarity join. IEEE Transactions on Knowledge and Data Engineering, 2013, 25(10): 2217–2230
https://doi.org/10.1109/TKDE.2012.195
23 Lu J H, Lin C B, Wang W, Li C, Wang H Y. String similarity measures and joins with synonyms. In: Proceedings of ACM SIGMOD International Conference on Management of Data. 2013, 373–384
https://doi.org/10.1145/2463676.2465313
24 Li G L, He J, Deng D, Li J. Efficient similarity join and search on multi-attribute data. In: Proceedings of ACM SIGMOD International Conference on Management of Data. 2015, 1137–1151
https://doi.org/10.1145/2723372.2723733
25 Salton G, McGill M J. Introduction to Modern Information Retrieval. New York: McGraw-Hill, Inc., 1986
26 Witten I H, Moffat A, Bell T C. Managing Gigabytes: Compressing and Indexing Documents and Images. 2nd ed. San Francisco, CA: Morgan Kaufmann, 1999
[1] Minghe YU,Guoliang LI,Dong DENG,Jianhua FENG. String similarity search and join: a survey[J]. Front. Comput. Sci., 2016, 10(3): 399-417.
[2] Yue WANG,Hongzhi WANG,Jianzhong LI,Hong GAO. Efficient graph similarity join for information integration on graphs[J]. Front. Comput. Sci., 2016, 10(2): 317-329.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed