频率连接：基于数据划分的一种高效字符串相似性连接算法

doi:10.1631/FITEE.1601347

Frontiers of Information Technology & Electronic Engineering

2017, Vol. 18

Issue (10): 1499-1510 https://doi.org/10.1631/FITEE.1601347

本期目录

频率连接：基于数据划分的一种高效字符串相似性连接算法

骆吉洲^1,²(

), 石胜飞¹(

), 王宏志¹(

), 李建中¹(

)

¹. 哈尔滨工业大学计算机科学与技术学院，中国哈尔滨市，150001
². 广东省普及型高性能重点实验室，深圳市服务计算与应用重点实验室，中国深圳市，518000

FrepJoin: an efficient partition-based algorithm for edit similarity join

Ji-zhou LUO^1,²(

), Sheng-fei SHI¹(

), Hong-zhi WANG¹(

), Jian-zhong LI¹(

)

¹. School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China
². Guangdong Key Laboratory of Popular High Performance Computers, Key Laboratory of Service Computing and Application, Shenzhen 518000, China

全文: PDF(401 KB) HTML

摘要:

字符串相似性连接（string similarity join, SSJ）在很多应用中，特别是在需要找出重复对象的应用中发挥着关键作用。本文关注基于编辑距离的字符串相似性连接。现有算法大多采用先过滤再细化的框架，使得它们很难发现和利用字符串子集间的不相似性，也很难利用如字符频率这样的统计信息。本研究提出了一种基于数据划分的字符串相似性连接算法，它充分利用了这种统计信息。采用频率向量将字符串集划分成一系列较小的子集，使得子集之间的不相似性很容易被发现。本文提出的新算法利用划分后的数据高效地对字符串进行相似性。此外，本文还给出了一个新的过滤器，它能利用字符频率来过滤很多能够通过现有过滤器的不相似字符串。真实数据集上的试验表明，本文提出的算法性能较现有算法有较大幅度的提升。

Abstract：

String similarity join (SSJ) is essential for many applications where near-duplicate objects need to be found. This paper targets SSJ with edit distance constraints. The existing algorithms usually adopt the filter-andrefine framework. They cannot catch the dissimilarity between string subsets, and do not fully exploit the statistics such as the frequencies of characters. We investigate to develop a partition-based algorithm by using such statistics. The frequency vectors are used to partition datasets into data chunks with dissimilarity between them being caught easily. A novel algorithm is designed to accelerate SSJ via the partitioned data. A new filter is proposed to leverage the statistics to avoid computing edit distances for a noticeable proportion of candidate pairs which survive the existing filters. Our algorithm outperforms alternative methods notably on real datasets.

Key words： String similarity join Edit distance Filter and refine Data partition Combined frequency vectors

收稿日期: 2016-06-17 出版日期: 2018-01-18

通讯作者: 骆吉洲 E-mail: luojizhou@hit.edu.cn;shengfei@hit.edu.cn;wangzh@hit.edu.cn;lijzh@hit.edu.cn

Corresponding Author(s): Ji-zhou LUO

引用本文:

骆吉洲, 石胜飞, 王宏志, 李建中. 频率连接：基于数据划分的一种高效字符串相似性连接算法[J]. Frontiers of Information Technology & Electronic Engineering, 2017, 18(10): 1499-1510.
Ji-zhou LUO, Sheng-fei SHI, Hong-zhi WANG, Jian-zhong LI. FrepJoin: an efficient partition-based algorithm for edit similarity join. Front. Inform. Technol. Electron. Eng, 2017, 18(10): 1499-1510.

链接本文:

https://academic.hep.com.cn/fitee/CN/10.1631/FITEE.1601347
https://academic.hep.com.cn/fitee/CN/Y2017/V18/I10/1499

1	Afrati , F.N., Sarma , A.D., Menestrina , D., et al., 2012. Fuzzy joins using MapReduce. Int. Conf. on Data Engineering, p.498–509.
2	Arasu , A., Ganti , V., Kaushik , R., 2006. Efficient exact set-similarity joins. Int. Conf. on Very Large Data Bases, p.918–929.
3	Bayardo , R.J., Ma , Y., Srikant , R., 2007. Scaling up all pairs similarity search. Int. World Wide Web Conf., p.131–140.
4	Chaudhuri , S., Ganjam , K., Ganti , V., et al., 2003. Robust and efficient fuzzy match for online data cleaning. Int. SIGMOD Conf. on Management of Data, p.313–324.
5	Chaudhuri , S., Ganti , V., Kaushik , R., 2006a. Data debugger: an operator-centric approach for data quality solutions. IEEE Data Eng. Bull. , 29(2):60–66.
6	Chaudhuri , S., Ganti , V., Kaushik , R., 2006b. A primitive operator for similarity joins in data cleaning. Int. Conf. on Data Engineering, p.687–698.
7	Dong , X., Halevy , A.Y., Yu , C., 2007. Data integration with uncertainty. Int. Conf. on Very Large Data Bases, p.687–698.
8	Feng , J., Wang , J., Li , G., 2012. Trie-join: a trie-based method for efficient string similarity joins. VLDB J. , 21(4):437–461.
9	Ge , T., Li , Z., 2011. Approximate substring matching over uncertain strings. Proc. VLDB Endow. , 4(11):772–782.
10	Gravano , L., Ipeirotis , P.G., Jagadish , H.V., et al., 2001. Approximate string joins in a database (almost) for free. Int. Conf. on Very Large Data Bases, p.491–500.
11	Hadjieleftheriou , M., Srivastava , D., 2010. Weighted Set-Based String Similarity. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering. AT&T Lab-Research.
12	Henzinger , M.R., 2006. Finding near-duplicate web pages: a large-scale evaluation of algorithms. Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, p.284–291.
13	Ji , S., Li , G., Li , C., et al., 2009. Efficient interactive fuzzy keyword search. Int. World Wide Web Conf., p.371–380.
14	Li , C., Lu , J., Lu , Y., 2008. Efficient merging and filtering algorithms for approximate string searches. Int. Conf. on Data Engineering, p.257–266.
15	Li , G., Deng , D., Wang , J., et al., 2011. Pass-Join: a partition-based method for similarity joins. Proc. VLDB Endow. , 5(3):253–264.
16	Metwally , A., Agrawal , D., Abbadi , A.E., 2007. Detectives: detecting coalition hit inflation attacks in advertising networks streams. Int. World Wide Web Conf., p.241–250.
17	Navarro , G., Salmela , L., 2009. Indexing variable length substrings for exact and approximate matching. Int. Symp. on String Processing and Information Retrieval, p.214–221.
18	Qin , J., Wang , W., Xiao , C., et al., 2013. Vchunkjoin: an efficient algorithm for edit similarity joins. Trans. Knowl. Dat. Eng. , 25(8):1916–1929.
19	Sarawagi , S., Kirpal , A., 2004. Efficient set joins on similarity predicates. Int. SIGMOD Conf. on Management of Data, p.743–754.
20	Scott , D.W., 1979. On optimal and data-based histograms. Biometrika, 66:605–610.
21	Vernica , R., Carey , M.J., Li , C., 2010. Efficient parallel set-similarity joins using MapReduce. Int. SIGMOD Conf. on Management of Data, p.495–506.
22	Wang , J., Li , G., Feng , J., 2011. Fast-join: an efficient method for fuzzy token matching based string similarity join. Int. Conf. on Data Engineering, p.458–469.
23	Wang , W., Xiao , C., Lin , X., et al., 2009. Efficient approximate entity extraction with edit distance constraints. Int. SIGMOD Conf. on Management of Data, p.759–770.
24	Xiao , C., Wang , W., Lin , X., 2008a. Ed-Join: an efficient algorithm for similarity joins with edit distance constraints. Proc. VLDB Endow. , 1(1):933–944.
25	Xiao , C., Wang , W., Lin , X., et al., 2008b. Efficient similarity joins for near duplicate detection. Int. World Wide Web Conf., p.131–140.

[1]		Download
[2]		Download

Viewed

Full text

Abstract

Cited

Shared

Discussed