Please wait a minute...
Frontiers of Computer Science

ISSN 2095-2228

ISSN 2095-2236(Online)

CN 10-1014/TP

Postal Subscription Code 80-970

2018 Impact Factor: 1.129

Front. Comput. Sci.    2017, Vol. 11 Issue (5) : 895-911    https://doi.org/10.1007/s11704-016-5346-4
RESEARCH ARTICLE
MapReduce-based entity matching with multiple blocking functions
Cheqing JIN(), Jie CHEN, Huiping LIU
Institute for Data Science and Engineering, School of Computer Science and Software Engineering, East China Normal University, Shanghai 200062, China
 Download: PDF(1097 KB)  
 Export: BibTeX | EndNote | Reference Manager | ProCite | RefWorks
Abstract

Entity matching that aims at finding some records belonging to the same real-world objects has been studied for decades. In order to avoid verifying every pair of records in a massive data set, a common method, known as the blockingbased method, tends to select a small proportion of record pairs for verification with a far lower cost thanO(n2), where n is the size of the data set. Furthermore, executing multiple blocking functions independently is critical since much more matching records can be found in this way, so that the quality of the query result can be improved significantly.

It is popular to use the MapReduce (MR) framework to improve the performance and the scalability of some complicated queries by running a lot of map (/reduce) tasks in parallel. However, entity matching upon the MapReduce framework is non-trivial due to two inevitable challenges: load balancing and pair deduplication. In this paper, we propose a novel solution, called MrEm, to handle these challenges with the support of multiple blocking functions. Although the existing work can deal with load balancing and pair deduplication respectively, it still cannot deal with both challenges at the same time. Theoretical analysis and experimental results upon real and synthetic data sets illustrate the high effectiveness and efficiency of our proposed solutions.

Keywords entity matching      MapReduce      load balancing      pair deduplication     
Corresponding Author(s): Cheqing JIN   
Just Accepted Date: 31 December 2015   Online First Date: 14 September 2016    Issue Date: 26 September 2017
 Cite this article:   
Cheqing JIN,Jie CHEN,Huiping LIU. MapReduce-based entity matching with multiple blocking functions[J]. Front. Comput. Sci., 2017, 11(5): 895-911.
 URL:  
https://academic.hep.com.cn/fcs/EN/10.1007/s11704-016-5346-4
https://academic.hep.com.cn/fcs/EN/Y2017/V11/I5/895
1 BenjellounO, Garcia-Molina H, MenestrinaD , SuQ, WhangS E, WidomJ. Swoosh: a generic approach to entity resolution. The VLDB Journal—The International Journal on Very Large Data Bases, 2009, 18(1): 255–276
2 BilenkoM, MooneyR J. Adadptive duplicate detection using learnable string similarity measures. In: Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2003, 39–48
3 GuoS T, DongX L, SrivastavaD , ZajacR. Record linkage with uniqueness constraints and erroneous values. Proceedings of the VLDB Endowment, 2010, 3(1–2): 417–428
https://doi.org/10.14778/1920841.1920897
4 LiP, DongX L, MaurinoA, Srivastava D. Linkingtemporal records. Proceedings of the VLDB Endowment, 2011, 4(11): 956–967
5 RastogiV, DalviN, GarofalakisM . Large-scale collective entity matching. Proceedings of the VLDB Endowment, 2011, 4(4): 208–218
https://doi.org/10.14778/1938545.1938546
6 BilenkoM, KamathB, MooneyR J. Adaptive blocking: learning to scale up record linkage. In: Proceedings of the 6th IEEE International Conference on Data Mining. 2006, 87–96
https://doi.org/10.1109/icdm.2006.13
7 ChristenP. A survey of indexing techniques for scalable record linkage and deduplication. IEEE Transactions on Knowledge and Data Engineering, 2012, 24(9): 1537–1555
https://doi.org/10.1109/TKDE.2011.127
8 De VriesT, KeH, ChawlaS, Christen P. Robust record linkage blocking using suffix arrays and bloom filters. ACM Transactions on Knowledge Discovery from Data, 2011, 5(2): 9
https://doi.org/10.1145/1921632.1921635
9 MichelsonM, Knoblock C A. Learning blocking schemes for record linkage. In: Proceedings of the National Conference on Artificial Intelligence. 2006, 440–445
10 FellegiI P, SunterA B. A theory for record linkage. Journal of the American Statistical Association, 1969, 64(328): 1183–1210
https://doi.org/10.1080/01621459.1969.10501049
11 HernándezM A, Stolfo S J. The merge/purge problem for large databases. ACM SIGMOD Record, 1995, 24(2): 127–138
https://doi.org/10.1145/568271.223807
12 GionisA, IndykP, MotwaniR. Similarity search in high dimensions via hashing. The VLDB Journal — The International Journal on Very Large Data Bases, 1999, 99(6): 518–529
13 IndykP, Motwani R. Approximate nearest neighbors: towards removing the curse of dimensionality. In: Proceedings of the 30th Annual ACM Symposium on Theory of Computing. 1998, 604–613
https://doi.org/10.1145/276698.276876
14 KolbL, ThorA, RahmE. Multi-pass sorted neighborhood blocking with MapReduce. Computer Science-Research and Development, 2012, 27(1): 45–63
https://doi.org/10.1007/s00450-011-0177-x
15 WhangS E, Menestrina D, KoutrikaG , TheobaldM, Garcia-Molina H. Entity resolution with iterative blocking. In: Proceedings of the 2009 ACM SIGMOD International Conference on Management of Data. 2009, 219–232
https://doi.org/10.1145/1559845.1559870
16 KolbL, ThorA, RahmE. Load balancing for MapReduce-based entity resolution. In: Proceedings of the 28th IEEE International Conference on Data Engineering. 2012, 618–629
https://doi.org/10.1109/icde.2012.22
17 KöpckeH, ThorA, RahmE. Evaluation of entity resolution approaches on real-world match problems. Proceedings of the VLDB Endowment, 2010, 3(1–2): 484–493
https://doi.org/10.14778/1920841.1920904
18 KolbL, ThorA, RahmE. Don’t match twice:redundancy-free similarity computation with MapReduce. In: Proceedings of the 2nd Workshop on Data Analytics in the Cloud. 2013, 1–5
https://doi.org/10.1145/2486767.2486768
19 KolbL, RahmE. Parallel entity resolution with dedoop. Datenbank- Spektrum, 2013, 13(1): 23–32
20 DeanJ, Ghemawat S. MapReduce: simplified data processing on large clusters. Communications of the ACM, 2008, 51(1): 107–113
https://doi.org/10.1145/1327452.1327492
21 WhiteT. Hadoop: The Definitive Guide. 3rd ed. O’Reilly Media, Inc., 2012
22 MitzenmacherM. Compressed bloom filters. IEEE/ACM Transactions on Networking, 2002, 10(5): 604–612
https://doi.org/10.1109/TNET.2002.803864
23 VernicaR, CareyM J, LiC. Efficient parallel set-similarity joins using MapReduce. In: Proceedings of the 2010 ACMSIGMOD International Conference on Management of Data. 2010, 495–506
https://doi.org/10.1145/1807167.1807222
24 BaxterR, Christen P, ChurchesT . A comparison of fast blocking methods for record linkage. ACM SIGKDD, 2003, 3: 25–27
25 CohenW W, Richman J. Learning to match and cluster large highdimensional data sets for data integration. In: Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2002, 475–480
26 JinL, LiC, MehrotraS. Efficient record linkage in large data sets. In: Proceedings of the 8th International Conference on Database Systems for Advanced Applications. 2003, 137–146
27 HeY B, TanH Y, LuoW M, Feng S Z, FanJ P . MR-DBSCAN: a scalable MapReduce-based DBSCAN algorithm for heavily skewed data. Frontiers of Computer Science, 2014, 8(1): 83–99
https://doi.org/10.1007/s11704-013-3158-3
28 Das SarmaA, HeY Y, ChaudhuriS. Clusterjoin: a similarity joins framework using map-reduce. Proceedings of the VLDB Endowment, 2014, 7(12): 1059–1070
https://doi.org/10.14778/2732977.2732981
29 DengD, LiG L, HaoS, Wang J N, FengJ H . Massjoin: a MapReducebased method for scalable string similarity joins. In: proceedings of the 30th IEEE International Conference on Data Engineering. 2014, 340–351
https://doi.org/10.1109/icde.2014.6816663
30 KimY, ShimK. Parallel top-k similarity join algorithms using MapReduce. In: Proceedings of the 28th IEEE International Conference on Data Engineering. 2012, 510–521
https://doi.org/10.1109/icde.2012.87
[1] FCS-0895-15346-CQJ_suppl_1 Download
[1] Zhuo WANG, Qun CHEN, Bo SUO, Wei PAN, Zhanhuai LI. Reducing partition skew on MapReduce: an incremental allocation approach[J]. Front. Comput. Sci., 2019, 13(5): 960-975.
[2] Yihong GAO, Huadong MA. StreamTune: dynamic resource scheduling approach for workload skew in video data center[J]. Front. Comput. Sci., 2018, 12(4): 669-681.
[3] Xiong FU, Juzhou CHEN, Song DENG, Junchang WANG, Lin ZHANG. Layered virtual machine migration algorithm for network resource balancing in cloud computing[J]. Front. Comput. Sci., 2018, 12(1): 75-85.
[4] Quanqing XU,Rajesh Vellore ARUMUGAM,Khai Leong YONG,Yonggang WEN,Yew-Soon ONG,Weiya XI. Adaptive and scalable load balancing for metadata server cluster in cloud-scale file systems[J]. Front. Comput. Sci., 2015, 9(6): 904-918.
[5] Xite WANG,Derong SHEN,Mei BAI,Tiezheng NIE,Yue KOU,Ge YU. SAMES: deadline-constraint scheduling in MapReduce[J]. Front. Comput. Sci., 2015, 9(1): 128-141.
[6] Huiju WANG,Furong LI,Xuan ZHOU,Yu CAO,Xiongpai QIN,Jidong CHEN,Shan WANG. HC-Store: putting MapReduce’s foot in two camps[J]. Front. Comput. Sci., 2014, 8(6): 859-871.
[7] Yaobin HE, Haoyu TAN, Wuman LUO, Shengzhong FENG, Jianping FAN. MR-DBSCAN: a scalable MapReduce-based DBSCAN algorithm for heavily skewed data[J]. Front. Comput. Sci., 2014, 8(1): 83-99.
[8] Xuejun YANG, Xiangke LIAO, Weixia XU, Junqiang SONG, Qingfeng HU, Jinshu SU, Liquan XIAO, Kai LU, Qiang DOU, Juping JIANG, Canqun YANG, . TH-1: China’s first petaflop supercomputer[J]. Front. Comput. Sci., 2010, 4(4): 445-455.
[9] Wenchao JIANG, Matthias BAUMGARTEN, Yanhong ZHOU, Hai JIN, . A bipartite model for load balancing in grid computing environments[J]. Front. Comput. Sci., 2009, 3(4): 503-523.
[10] Zhiwei XU , Li ZHA , Yongqiang HE , Wei LIN , . Four styles of parallel and net programming[J]. Front. Comput. Sci., 2009, 3(3): 290-301.
[11] YANG Xuejun, WANG Panfeng, DU Yunfei, ZHOU Haifang. A data-distributed parallel algorithm for wavelet-based fusion of remote sensing images[J]. Front. Comput. Sci., 2007, 1(2): 231-240.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed