Please wait a minute...
Frontiers of Computer Science

ISSN 2095-2228

ISSN 2095-2236(Online)

CN 10-1014/TP

Postal Subscription Code 80-970

2018 Impact Factor: 1.129

Front. Comput. Sci.    2017, Vol. 11 Issue (1) : 147-159    https://doi.org/10.1007/s11704-015-5276-6
RESEARCH ARTICLE
A genetic algorithm based entity resolution approach with active learning
Chenchen SUN(),Derong SHEN,Yue KOU,Tiezheng NIE,Ge YU
Sohool of Information Science and Engineering, Northeastern University, Shenyang 110819, China
 Download: PDF(451 KB)  
 Export: BibTeX | EndNote | Reference Manager | ProCite | RefWorks
Abstract

Entity resolution is a key aspect in data quality and data integration, identifying which records correspond to the same real world entity in data sources. Many existing approaches require manually designed match rules to solve the problem, which always needs domain knowledge and is time consuming. We propose a novel genetic algorithm based entity resolution approach via active learning. It is able to learn effective match rules by logically combining several different attributes’ comparisons with proper thresholds. We use active learning to reduce manually labeled data and speed up the learning process. The extensive evaluation shows that the proposed approach outperforms the sate-of-the-art entity resolution approaches in accuracy.

Keywords entity resolution      genetic algorithm      active learning      data quality      data integration     
Corresponding Author(s): Chenchen SUN   
Just Accepted Date: 02 November 2015   Online First Date: 19 April 2016    Issue Date: 11 January 2017
 Cite this article:   
Chenchen SUN,Derong SHEN,Yue KOU, et al. A genetic algorithm based entity resolution approach with active learning[J]. Front. Comput. Sci., 2017, 11(1): 147-159.
 URL:  
https://academic.hep.com.cn/fcs/EN/10.1007/s11704-015-5276-6
https://academic.hep.com.cn/fcs/EN/Y2017/V11/I1/147
1 Chen J C, Chen Y G, Du X Y, Li C P, Lu J H, Zhao S Y, Zhou X. Big data challenge: a data management perspective. Frontiers of Computer Science, 2013, 7(2): 157–164
https://doi.org/10.1007/s11704-013-3903-7
2 Fellegi I P, Sunter A B. A theory for record linkage. Journal of the American Statistical Association, 1969, 64(328): 1183–1210
https://doi.org/10.1080/01621459.1969.10501049
3 Elmagarmid A K, Ipeirotis P G, Verykios V S. Duplicate record detection: a survey. IEEE Transactions on Knowledge and Data Engineering, 2007, 19(1): 1–16
https://doi.org/10.1109/TKDE.2007.250581
4 Kopcke H, Rahm E. Frameworks for entity matching: a comparison. Data and Knowledge Engineering, 2010, 69(2): 197–210
https://doi.org/10.1016/j.datak.2009.10.003
5 Sarawagi S, Bhamidipaty A. Interactive deduplication using active learning. In: Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2002, 269–278
https://doi.org/10.1145/775047.775087
6 Monge A E, Elkan C. The field matching problem: algorithms and applications. In: Proceedings of the 2nd ACM SIGKDD International Conference on Knowledge Discovery and DataMining. 1996, 267–270
7 Pinheiro J C, Sun D X. Methods for linking and mining massive heterogeneous databases. In: Proceedings of the 4th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 1998, 309–313
8 Bilenko M, Mooney R J. Adaptive duplicate detection using learnable string similarity measures. In: Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2003, 39–48
https://doi.org/10.1145/956750.956759
9 Minton S N, Nanjo C, Knoblock C A, Michalowski M, Michelson M. A heterogeneous field matching method for record linkage. In: 5th IEEE International Conference on Data Mining. 2005, 8
https://doi.org/10.1109/ICDM.2005.7
10 Sun C, Shen D, Kou Y, Nie T, Yu G. ERGP: A Combined Entity Resolution Approach with Genetic Programming. In: Proceedings of the 11th IEEE Web Information System and Application Conference. 2014, 215–220
https://doi.org/10.1109/wisa.2014.46
11 Li P, Dong X L, Maurino A, Srivastava D. Linking temporal records. Frontiers of Computer Science, 2012, 6(3): 293–312
https://doi.org/10.1007/s11704-012-2002-5
12 Sun C C, Shen D R, Kou Y, Nie T Z, Yu G. GB-JER: A Graph-Based Model for Joint Entity Resolution. In: Proceedings of the 20th International Conference on Database Systems for Advanced Applications. 2015, 458–473
13 Tejada S, Knoblock C A, Minton S. Learning object identification rules for information integration. Information Systems, 2001, 26(8): 607–633
https://doi.org/10.1016/S0306-4379(01)00042-4
14 Winkler W E. Methods for record linkage and bayesian networks. Technical report, Series RRS2002/05. 2002
15 De Carvalho M G, Gonçalves M A, Laender A H F, Da Silva A S. Learning to deduplicate. In: Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries. 2006, 41–50
https://doi.org/10.1145/1141753.1141760
16 De Carvalho M G, Laender A H, Gonçalves M A, Da Silva A S. Replica identification using genetic programming. In: Proceedings of the 2008 ACM Symposium on Applied Computing. 2008, 1801–1806
https://doi.org/10.1145/1363686.1364118
17 De Carvalho M G, Laender A H, Gonçalves M A, Da Silva A S. A genetic programming approach to record deduplication. IEEE Transactions on Knowledge and Data Engineering, 2012, 24(3): 399–412
https://doi.org/10.1109/TKDE.2010.234
18 Isele R, Bizer C. Learning linkage rules using genetic programming. In: Proceedings of the 6th International Workshop on Ontology Matching. 2011, 13–24
19 Isele R, Bizer C. Learning expressive linkage rules using genetic programming. The Very Large Databases Endowment, 2012, 5(11): 1638–1649
https://doi.org/10.14778/2350229.2350276
20 Banzhaf W, Nordin P, Keller R E, Francone F D. Genetic programming: an introduction. San Francisco: Morgan Kaufmann, 1998
https://doi.org/10.1007/BFb0055923
21 Poli R, Langdon W B, McPhee N F, Koza J R. A field guide to genetic programming. Lulu. com, 2008
22 Liere R, Tadepalli P. Active learning with committees for text categorization. In: Proceedings of the 14th National Conference on Artificial Intelligence. 1997, 591–596
23 Cohn D A, Atlas L, R Ladner. Improving generalization with active learning. Machine Learning, 1994, 15(2): 201–221
https://doi.org/10.1007/BF00993277
24 Bellare K, Suresh I, Parameswaran A, Rastogi V. Active sampling for entity matching with guarantees. ACM Transactions on Knowledge Discovery from Data, 2013, 7(3): 12
https://doi.org/10.1145/2513092.2500490
25 Arasu A, Gotz M, Kaushik R. On active learning of record matching packages. In: Proceedings of the 2010 International Conference on Management of Data. 2010, 783–794
https://doi.org/10.1145/1807167.1807252
26 Koza J R. Genetic Programming: on the Programming of Computers by Means of Natural Selection. Boston: MIT press, 1992
27 Cohen W, Ravikumar P, Fienberg S. A comparison of string metrics for matching names and records. In: Proceedings of the 9th ACM SIGKDDWorkshop on Data Cleaning and Object Consolidation. 2003, 73–78
28 Baeza-Yates R, Ribeiro-Neto B. Modern information retrieval. New York: ACM press, 1999
29 Blickle T, Thiele L. A Comparison of Selection Schemes Used in Genetic Algorithms. TIK-ReportNo.11, 1995
30 Monge A E, Elkan C P. An efficient domain-independent algorithm for detecting approximately duplicate database records. In: Proceedings of the 2nd ACM SIGMOD Workshop Research Issues in Data Mining and Knowledge Discovery. 1997, 23–29
31 Shannon C E. A mathematical theory of communication. Bell System Technical Journal, 1948, 27(3): 379–423
https://doi.org/10.1002/j.1538-7305.1948.tb01338.x
32 Hassanzadeh O, Chiang F, Lee H C, Miller R J. Framework for evaluating clustering algorithms in duplicate detection. The Very Large Databases Endowment, 2009, 2(1): 1282–1293
https://doi.org/10.14778/1687627.1687771
[1]  Supplementary Material Download
[2] Download
[1] Hao SHAO. Query by diverse committee in transfer active learning[J]. Front. Comput. Sci., 2019, 13(2): 280-291.
[2] Yan ZHANG, Hongzhi WANG, Long YANG, Jianzhong LI. Efficient histogram-based range query estimation for dirty data[J]. Front. Comput. Sci., 2018, 12(5): 984-999.
[3] Zhenxue HE, Limin XIAO, Fei GU, Tongsheng XIA, Shubin SU, Zhisheng HUO, Rong ZHANG, Longbing ZHANG, Li RUAN, Xiang WANG. An efficient and fast polarity optimization approach for mixed polarity Reed-Muller logic circuits[J]. Front. Comput. Sci., 2017, 11(4): 728-742.
[4] Sedigheh KHOSHNEVIS, Fereidoon SHAMS. Automating identification of services and their variability for product lines using NSGA-II[J]. Front. Comput. Sci., 2017, 11(3): 444-464.
[5] Jinyu CHEN, Shihua ZHANG. Integrative cancer genomics: models, algorithms and analysis[J]. Front. Comput. Sci., 2017, 11(3): 392-406.
[6] Lamia SADEG-BELKACEM,Zineb HABBAS,Wassila AGGOUNE-MTALAA. Adaptive genetic algorithms guided by decomposition for PCSPs: application to frequency assignment problems[J]. Front. Comput. Sci., 2016, 10(6): 1012-1025.
[7] Nengneng GAO,Sheng-Jun HUANG,Songcan CHEN. Multi-label active learning by model guided distribution matching[J]. Front. Comput. Sci., 2016, 10(5): 845-855.
[8] Jie XIN,Zhiming CUI,Pengpeng ZHAO,Tianxu HE. Active transfer learning of matching query results across multiple sources[J]. Front. Comput. Sci., 2015, 9(4): 595-607.
[9] Xiangxiang ZENG,Sisi YUAN,Xianxian HUANG,Quan ZOU. Identification of cytokine via an improved genetic algorithm[J]. Front. Comput. Sci., 2015, 9(4): 643-651.
[10] Priyanka CHAWLA,Inderveer CHANA,Ajay RANA. A novel strategy for automatic test data generation using soft computing technique[J]. Front. Comput. Sci., 2015, 9(3): 346-363.
[11] Dunwei GONG, Yan ZHANG. Generating test data for both path coverage and fault detection using genetic algorithms[J]. Front Comput Sci, 2013, 7(6): 822-837.
[12] Dion DETTERER, Paul KWAN, Cedric GONDRO. A co-evolving memetic wrapper for prediction of patient outcomes in TCM informatics[J]. Front Comput Sci, 2012, 6(5): 621-629.
[13] Bo YUAN, Wenhuang LIU. Measure oriented training: a targeted approach to imbalanced classification problems[J]. Front Comput Sci, 2012, 6(5): 489-497.
[14] Pei LI, Xin Luna DONG, Andrea MAURINO, Divesh SRIVASTAVA. Linking temporal records[J]. Front Comput Sci, 2012, 6(3): 293-312.
[15] Jaffer GARDEZI, Leopoldo BERTOSSI, Iluju KIRINGA. Matching dependencies: semantics and query answering[J]. Front Comput Sci, 2012, 6(3): 278-292.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed