|
|
Active transfer learning of matching query results across multiple sources |
Jie XIN1,2,*( ),Zhiming CUI1,Pengpeng ZHAO1,Tianxu HE1 |
1. The Institute of Intelligent Information Processing and Application, Soochow University, Suzhou 215006, China 2. Provincial Key Laboratory for Computer Information Processing Technology, Soochow University, Suzhou 215006, China |
|
|
Abstract Entity resolution (ER) is the problem of identifying and grouping different manifestations of the same real world object. Algorithmic approaches have been developed where most tasks offer superior performance under supervised learning. However, the prohibitive cost of labeling training data is still a huge obstacle for detecting duplicate query records from online sources. Furthermore, the unique combinations of noisy data with missing elements make ER tasks more challenging. To address this, transfer learning has been adopted to adaptively share learned common structures of similarity scoring problems between multiple sources. Although such techniques reduce the labeling cost so that it is linear with respect to the number of sources, its random sampling strategy is not successful enough to handle the ordinary sample imbalance problem. In this paper, we present a novel multi-source active transfer learning framework to jointly select fewer data instances from all sources to train classifiers with constant precision/recall. The intuition behind our approach is to actively label the most informative samples while adaptively transferring collective knowledge between sources. In this way, the classifiers that are learned can be both label-economical and flexible even for imbalanced or quality diverse sources. We compare our method with the state-of-the-art approaches on real-word datasets. Our experimental results demonstrate that our active transfer learning algorithm can achieve impressive performance with far fewer labeled samples for record matching with numerous and varied sources.
|
Keywords
entity resolution
active learning
transfer learning
convex optimization
|
Corresponding Author(s):
Jie XIN
|
Just Accepted Date: 31 December 2014
Issue Date: 07 September 2015
|
|
1 |
Getoor L, Machanavajjhala A. Entity resolution: theory, practice & open challenges. Proceedings of the VLDB Endowment, 2012, 5(12): 2018―2019
https://doi.org/10.14778/2367502.2367564
|
2 |
Negahban N, Rubinstein P, Gemmell G. Scaling multiple-source entity resolution using statistically efficient transfer learning. In: Proceedings of the 21st ACM International Conference on Information and Knowledge Management. 2012, 2224―2228
https://doi.org/10.1145/2396761.2398606
|
3 |
Arasu A, G?tz M, Kaushik R. On active learning of record matching packages. In: Proceedings of the 2010 International Conference on Management of Data. 2010, 783―794
https://doi.org/10.1145/1807167.1807252
|
4 |
Bellare K, Iyengar S, Parameswaran A, Rastogi V. Active sampling for entity matching. In: Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2012, 1131―1139
https://doi.org/10.1145/2339530.2339707
|
5 |
Chuang S L, Chang K C C. Integrating web query results: holistic schema matching. In: Proceedings of the 17th ACM Conference on Information and Knowledge Management. 2008, 33―42
https://doi.org/10.1145/1458082.1458090
|
6 |
K?pcke H, Rahm E. Frameworks for entity matching: a comparison. Data & Knowledge Engineering, 2010, 69(2): 197―210
https://doi.org/10.1016/j.datak.2009.10.003
|
7 |
Winkler W E. The state of record linkage and current research problems. In: Proceedings of Statistical Research Division, US Census Bureau. 1999
|
8 |
Chaudhuri S, Chen B C, Ganti V, Kaushik R. Example-driven design of efficient record matching queries. In: Proceedings of the 33rd International Conference on Very Large Data Bases. 2007, 327―338
|
9 |
Bilenko M, Mooney R J. Adaptive duplicate detection using learnable string similarity measures. In: Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2003, 39―48
https://doi.org/10.1145/956750.956759
|
10 |
Su W, Wang J, Lochovsky F H. Record matching over query results from multiple web databases. IEEE Transactions on Knowledge and Data Engineering, 2010, 22(4): 578―589
https://doi.org/10.1109/TKDE.2009.90
|
11 |
K?pcke H, Rahm E. Training selection for tuning entity matching. In: Proceedings of QDB/MUD. 2008, 3―12
|
12 |
Altwaijry H, Kalashnikov D V, Mehrotra S. Query-driven approach to entity resolution. Proceedings of the VLDB Endowment, 2013, 6(14): 1846―1857
https://doi.org/10.14778/2556549.2556567
|
13 |
Singla P, Domingos P. Entity resolution with Markov logic. In: Proceedings of International Conference on Data Mining. 2006, 572―582
https://doi.org/10.1109/icdm.2006.65
|
14 |
Liu W, Xiao J G. A duplicate web entity identification approach based on iterative training. Frontiers of Computer Science and Technology, 2010, (007): 599―607
|
15 |
Wang J, Kraska T, Franklin M J, Feng J. Crowder: crowdsourcing entity resolution. Proceedings of the VLDB Endowment, 2012, 5(11): 1483―1494
https://doi.org/10.14778/2350229.2350263
|
16 |
Pan S J, Yang Q. A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering, 2010, 22(10): 1345―1359
https://doi.org/10.1109/TKDE.2009.191
|
17 |
Yang L, Hanneke S, Carbonell J. A theory of transfer learning with applications to active learning. Machine Learning, 2013, 90(2): 161―189
https://doi.org/10.1007/s10994-012-5310-y
|
18 |
Shi X, Fan W, Ren J. Actively transfer domain knowledge. In: Proceedings of ECML/PKDD. 2008, 342―357
https://doi.org/10.1007/978-3-540-87481-2_23
|
19 |
Zhao L, Pan S J, Xiang E W, Zhong E, Lu Z, Yang Q. Active transfer learning for cross-system recommendation. In: Proceedings of the 27th AAAI Conference on Artificial Intelogence. 2013, 1205―1211
|
20 |
Fang M, Yin J, Zhu X. Knowledge transfer for multi-labeler active learning. Lecture Notes in Computer Science, 2013, 8188: 273―288
https://doi.org/10.1007/978-3-642-40988-2_18
|
21 |
Jun G, Ghosh J. An efficient active learning algorithm with knowledge transfer for hyperspectral data analysis. In: Proceedings of Geoscience and Remote Sensing Symposium. 2008, 1: I-52―I-55
https://doi.org/10.1109/igarss.2008.4778790
|
22 |
Li L, Jin X, Pan S J, Sun J T. Multi-domain active learning for text classification. In: Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2012, 1086―1094
https://doi.org/10.1145/2339530.2339701
|
23 |
Christen P. Automatic record linkage using seeded nearest neighbor and support vector machine classification. In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2008, 151―159
https://doi.org/10.1145/1401890.1401913
|
24 |
Benjelloun O, Garcia-Molina H, Menestrina D, Su Q, Whang S E, Widom J. Swoosh: a generic approach to entity resolution. The International Journal on Very Large Data Bases, 2009, 18(1): 255―276
https://doi.org/10.1007/s00778-008-0098-x
|
25 |
Boyd S P, Vandenberghe L. Convex Optimization. Cambridge University Press, 2004
https://doi.org/10.1017/CBO9780511804441
|
26 |
Jalali A, Ravikumar P D, Sanghavi S, Ruan C. A dirty model for multitask learning. In: Proceedings of Advances in Neural Information Processing Systems. 2010, 964―972
|
27 |
Bickel P J, Ritov Y A, Tsybakov A B. Simultaneous analysis of lasso and dantzig selector. The Annals of Statistics, 2009, 37(4): 1705―1732
https://doi.org/10.1214/08-AOS620
|
28 |
Tong S. Active Learning: Theory and Applications. Stanford University, 2001
|
29 |
Davis J, Goadrich M. The relationship between precision-recall and ROC curves. In: Proceedings of the 23rd International Conference on Machine Learning. 2006, 233―240
https://doi.org/10.1145/1143844.1143874
|
[1] |
Supplementary Material-Highlights in 3-page ppt
|
Download
|
|
Viewed |
|
|
|
Full text
|
|
|
|
|
Abstract
|
|
|
|
|
Cited |
|
|
|
|
|
Shared |
|
|
|
|
|
Discussed |
|
|
|
|