|
|
A genetic algorithm based entity resolution approach with active learning |
Chenchen SUN( ),Derong SHEN,Yue KOU,Tiezheng NIE,Ge YU |
Sohool of Information Science and Engineering, Northeastern University, Shenyang 110819, China |
|
|
Abstract Entity resolution is a key aspect in data quality and data integration, identifying which records correspond to the same real world entity in data sources. Many existing approaches require manually designed match rules to solve the problem, which always needs domain knowledge and is time consuming. We propose a novel genetic algorithm based entity resolution approach via active learning. It is able to learn effective match rules by logically combining several different attributes’ comparisons with proper thresholds. We use active learning to reduce manually labeled data and speed up the learning process. The extensive evaluation shows that the proposed approach outperforms the sate-of-the-art entity resolution approaches in accuracy.
|
Keywords
entity resolution
genetic algorithm
active learning
data quality
data integration
|
Corresponding Author(s):
Chenchen SUN
|
Just Accepted Date: 02 November 2015
Online First Date: 19 April 2016
Issue Date: 11 January 2017
|
|
1 |
Chen J C, Chen Y G, Du X Y, Li C P, Lu J H, Zhao S Y, Zhou X. Big data challenge: a data management perspective. Frontiers of Computer Science, 2013, 7(2): 157–164
https://doi.org/10.1007/s11704-013-3903-7
|
2 |
Fellegi I P, Sunter A B. A theory for record linkage. Journal of the American Statistical Association, 1969, 64(328): 1183–1210
https://doi.org/10.1080/01621459.1969.10501049
|
3 |
Elmagarmid A K, Ipeirotis P G, Verykios V S. Duplicate record detection: a survey. IEEE Transactions on Knowledge and Data Engineering, 2007, 19(1): 1–16
https://doi.org/10.1109/TKDE.2007.250581
|
4 |
Kopcke H, Rahm E. Frameworks for entity matching: a comparison. Data and Knowledge Engineering, 2010, 69(2): 197–210
https://doi.org/10.1016/j.datak.2009.10.003
|
5 |
Sarawagi S, Bhamidipaty A. Interactive deduplication using active learning. In: Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2002, 269–278
https://doi.org/10.1145/775047.775087
|
6 |
Monge A E, Elkan C. The field matching problem: algorithms and applications. In: Proceedings of the 2nd ACM SIGKDD International Conference on Knowledge Discovery and DataMining. 1996, 267–270
|
7 |
Pinheiro J C, Sun D X. Methods for linking and mining massive heterogeneous databases. In: Proceedings of the 4th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 1998, 309–313
|
8 |
Bilenko M, Mooney R J. Adaptive duplicate detection using learnable string similarity measures. In: Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2003, 39–48
https://doi.org/10.1145/956750.956759
|
9 |
Minton S N, Nanjo C, Knoblock C A, Michalowski M, Michelson M. A heterogeneous field matching method for record linkage. In: 5th IEEE International Conference on Data Mining. 2005, 8
https://doi.org/10.1109/ICDM.2005.7
|
10 |
Sun C, Shen D, Kou Y, Nie T, Yu G. ERGP: A Combined Entity Resolution Approach with Genetic Programming. In: Proceedings of the 11th IEEE Web Information System and Application Conference. 2014, 215–220
https://doi.org/10.1109/wisa.2014.46
|
11 |
Li P, Dong X L, Maurino A, Srivastava D. Linking temporal records. Frontiers of Computer Science, 2012, 6(3): 293–312
https://doi.org/10.1007/s11704-012-2002-5
|
12 |
Sun C C, Shen D R, Kou Y, Nie T Z, Yu G. GB-JER: A Graph-Based Model for Joint Entity Resolution. In: Proceedings of the 20th International Conference on Database Systems for Advanced Applications. 2015, 458–473
|
13 |
Tejada S, Knoblock C A, Minton S. Learning object identification rules for information integration. Information Systems, 2001, 26(8): 607–633
https://doi.org/10.1016/S0306-4379(01)00042-4
|
14 |
Winkler W E. Methods for record linkage and bayesian networks. Technical report, Series RRS2002/05. 2002
|
15 |
De Carvalho M G, Gonçalves M A, Laender A H F, Da Silva A S. Learning to deduplicate. In: Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries. 2006, 41–50
https://doi.org/10.1145/1141753.1141760
|
16 |
De Carvalho M G, Laender A H, Gonçalves M A, Da Silva A S. Replica identification using genetic programming. In: Proceedings of the 2008 ACM Symposium on Applied Computing. 2008, 1801–1806
https://doi.org/10.1145/1363686.1364118
|
17 |
De Carvalho M G, Laender A H, Gonçalves M A, Da Silva A S. A genetic programming approach to record deduplication. IEEE Transactions on Knowledge and Data Engineering, 2012, 24(3): 399–412
https://doi.org/10.1109/TKDE.2010.234
|
18 |
Isele R, Bizer C. Learning linkage rules using genetic programming. In: Proceedings of the 6th International Workshop on Ontology Matching. 2011, 13–24
|
19 |
Isele R, Bizer C. Learning expressive linkage rules using genetic programming. The Very Large Databases Endowment, 2012, 5(11): 1638–1649
https://doi.org/10.14778/2350229.2350276
|
20 |
Banzhaf W, Nordin P, Keller R E, Francone F D. Genetic programming: an introduction. San Francisco: Morgan Kaufmann, 1998
https://doi.org/10.1007/BFb0055923
|
21 |
Poli R, Langdon W B, McPhee N F, Koza J R. A field guide to genetic programming. Lulu. com, 2008
|
22 |
Liere R, Tadepalli P. Active learning with committees for text categorization. In: Proceedings of the 14th National Conference on Artificial Intelligence. 1997, 591–596
|
23 |
Cohn D A, Atlas L, R Ladner. Improving generalization with active learning. Machine Learning, 1994, 15(2): 201–221
https://doi.org/10.1007/BF00993277
|
24 |
Bellare K, Suresh I, Parameswaran A, Rastogi V. Active sampling for entity matching with guarantees. ACM Transactions on Knowledge Discovery from Data, 2013, 7(3): 12
https://doi.org/10.1145/2513092.2500490
|
25 |
Arasu A, Gotz M, Kaushik R. On active learning of record matching packages. In: Proceedings of the 2010 International Conference on Management of Data. 2010, 783–794
https://doi.org/10.1145/1807167.1807252
|
26 |
Koza J R. Genetic Programming: on the Programming of Computers by Means of Natural Selection. Boston: MIT press, 1992
|
27 |
Cohen W, Ravikumar P, Fienberg S. A comparison of string metrics for matching names and records. In: Proceedings of the 9th ACM SIGKDDWorkshop on Data Cleaning and Object Consolidation. 2003, 73–78
|
28 |
Baeza-Yates R, Ribeiro-Neto B. Modern information retrieval. New York: ACM press, 1999
|
29 |
Blickle T, Thiele L. A Comparison of Selection Schemes Used in Genetic Algorithms. TIK-ReportNo.11, 1995
|
30 |
Monge A E, Elkan C P. An efficient domain-independent algorithm for detecting approximately duplicate database records. In: Proceedings of the 2nd ACM SIGMOD Workshop Research Issues in Data Mining and Knowledge Discovery. 1997, 23–29
|
31 |
Shannon C E. A mathematical theory of communication. Bell System Technical Journal, 1948, 27(3): 379–423
https://doi.org/10.1002/j.1538-7305.1948.tb01338.x
|
32 |
Hassanzadeh O, Chiang F, Lee H C, Miller R J. Framework for evaluating clustering algorithms in duplicate detection. The Very Large Databases Endowment, 2009, 2(1): 1282–1293
https://doi.org/10.14778/1687627.1687771
|
|
Viewed |
|
|
|
Full text
|
|
|
|
|
Abstract
|
|
|
|
|
Cited |
|
|
|
|
|
Shared |
|
|
|
|
|
Discussed |
|
|
|
|