Please wait a minute...
Frontiers of Computer Science

ISSN 2095-2228

ISSN 2095-2236(Online)

CN 10-1014/TP

Postal Subscription Code 80-970

2018 Impact Factor: 1.129

Front. Comput. Sci.    2015, Vol. 9 Issue (4) : 595-607    https://doi.org/10.1007/s11704-015-4068-3
RESEARCH ARTICLE
Active transfer learning of matching query results across multiple sources
Jie XIN1,2,*(),Zhiming CUI1,Pengpeng ZHAO1,Tianxu HE1
1. The Institute of Intelligent Information Processing and Application, Soochow University, Suzhou 215006, China
2. Provincial Key Laboratory for Computer Information Processing Technology, Soochow University, Suzhou 215006, China
 Download: PDF(562 KB)  
 Export: BibTeX | EndNote | Reference Manager | ProCite | RefWorks
Abstract

Entity resolution (ER) is the problem of identifying and grouping different manifestations of the same real world object. Algorithmic approaches have been developed where most tasks offer superior performance under supervised learning. However, the prohibitive cost of labeling training data is still a huge obstacle for detecting duplicate query records from online sources. Furthermore, the unique combinations of noisy data with missing elements make ER tasks more challenging. To address this, transfer learning has been adopted to adaptively share learned common structures of similarity scoring problems between multiple sources. Although such techniques reduce the labeling cost so that it is linear with respect to the number of sources, its random sampling strategy is not successful enough to handle the ordinary sample imbalance problem. In this paper, we present a novel multi-source active transfer learning framework to jointly select fewer data instances from all sources to train classifiers with constant precision/recall. The intuition behind our approach is to actively label the most informative samples while adaptively transferring collective knowledge between sources. In this way, the classifiers that are learned can be both label-economical and flexible even for imbalanced or quality diverse sources. We compare our method with the state-of-the-art approaches on real-word datasets. Our experimental results demonstrate that our active transfer learning algorithm can achieve impressive performance with far fewer labeled samples for record matching with numerous and varied sources.

Keywords entity resolution      active learning      transfer learning      convex optimization     
Corresponding Author(s): Jie XIN   
Just Accepted Date: 31 December 2014   Issue Date: 07 September 2015
 Cite this article:   
Jie XIN,Zhiming CUI,Pengpeng ZHAO, et al. Active transfer learning of matching query results across multiple sources[J]. Front. Comput. Sci., 2015, 9(4): 595-607.
 URL:  
https://academic.hep.com.cn/fcs/EN/10.1007/s11704-015-4068-3
https://academic.hep.com.cn/fcs/EN/Y2015/V9/I4/595
1 Getoor L, Machanavajjhala A. Entity resolution: theory, practice & open challenges. Proceedings of the VLDB Endowment, 2012, 5(12): 2018―2019
https://doi.org/10.14778/2367502.2367564
2 Negahban N, Rubinstein P, Gemmell G. Scaling multiple-source entity resolution using statistically efficient transfer learning. In: Proceedings of the 21st ACM International Conference on Information and Knowledge Management. 2012, 2224―2228
https://doi.org/10.1145/2396761.2398606
3 Arasu A, G?tz M, Kaushik R. On active learning of record matching packages. In: Proceedings of the 2010 International Conference on Management of Data. 2010, 783―794
https://doi.org/10.1145/1807167.1807252
4 Bellare K, Iyengar S, Parameswaran A, Rastogi V. Active sampling for entity matching. In: Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2012, 1131―1139
https://doi.org/10.1145/2339530.2339707
5 Chuang S L, Chang K C C. Integrating web query results: holistic schema matching. In: Proceedings of the 17th ACM Conference on Information and Knowledge Management. 2008, 33―42
https://doi.org/10.1145/1458082.1458090
6 K?pcke H, Rahm E. Frameworks for entity matching: a comparison. Data & Knowledge Engineering, 2010, 69(2): 197―210
https://doi.org/10.1016/j.datak.2009.10.003
7 Winkler W E. The state of record linkage and current research problems. In: Proceedings of Statistical Research Division, US Census Bureau. 1999
8 Chaudhuri S, Chen B C, Ganti V, Kaushik R. Example-driven design of efficient record matching queries. In: Proceedings of the 33rd International Conference on Very Large Data Bases. 2007, 327―338
9 Bilenko M, Mooney R J. Adaptive duplicate detection using learnable string similarity measures. In: Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2003, 39―48
https://doi.org/10.1145/956750.956759
10 Su W, Wang J, Lochovsky F H. Record matching over query results from multiple web databases. IEEE Transactions on Knowledge and Data Engineering, 2010, 22(4): 578―589
https://doi.org/10.1109/TKDE.2009.90
11 K?pcke H, Rahm E. Training selection for tuning entity matching. In: Proceedings of QDB/MUD. 2008, 3―12
12 Altwaijry H, Kalashnikov D V, Mehrotra S. Query-driven approach to entity resolution. Proceedings of the VLDB Endowment, 2013, 6(14): 1846―1857
https://doi.org/10.14778/2556549.2556567
13 Singla P, Domingos P. Entity resolution with Markov logic. In: Proceedings of International Conference on Data Mining. 2006, 572―582
https://doi.org/10.1109/icdm.2006.65
14 Liu W, Xiao J G. A duplicate web entity identification approach based on iterative training. Frontiers of Computer Science and Technology, 2010, (007): 599―607
15 Wang J, Kraska T, Franklin M J, Feng J. Crowder: crowdsourcing entity resolution. Proceedings of the VLDB Endowment, 2012, 5(11): 1483―1494
https://doi.org/10.14778/2350229.2350263
16 Pan S J, Yang Q. A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering, 2010, 22(10): 1345―1359
https://doi.org/10.1109/TKDE.2009.191
17 Yang L, Hanneke S, Carbonell J. A theory of transfer learning with applications to active learning. Machine Learning, 2013, 90(2): 161―189
https://doi.org/10.1007/s10994-012-5310-y
18 Shi X, Fan W, Ren J. Actively transfer domain knowledge. In: Proceedings of ECML/PKDD. 2008, 342―357
https://doi.org/10.1007/978-3-540-87481-2_23
19 Zhao L, Pan S J, Xiang E W, Zhong E, Lu Z, Yang Q. Active transfer learning for cross-system recommendation. In: Proceedings of the 27th AAAI Conference on Artificial Intelogence. 2013, 1205―1211
20 Fang M, Yin J, Zhu X. Knowledge transfer for multi-labeler active learning. Lecture Notes in Computer Science, 2013, 8188: 273―288
https://doi.org/10.1007/978-3-642-40988-2_18
21 Jun G, Ghosh J. An efficient active learning algorithm with knowledge transfer for hyperspectral data analysis. In: Proceedings of Geoscience and Remote Sensing Symposium. 2008, 1: I-52―I-55
https://doi.org/10.1109/igarss.2008.4778790
22 Li L, Jin X, Pan S J, Sun J T. Multi-domain active learning for text classification. In: Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2012, 1086―1094
https://doi.org/10.1145/2339530.2339701
23 Christen P. Automatic record linkage using seeded nearest neighbor and support vector machine classification. In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2008, 151―159
https://doi.org/10.1145/1401890.1401913
24 Benjelloun O, Garcia-Molina H, Menestrina D, Su Q, Whang S E, Widom J. Swoosh: a generic approach to entity resolution. The International Journal on Very Large Data Bases, 2009, 18(1): 255―276
https://doi.org/10.1007/s00778-008-0098-x
25 Boyd S P, Vandenberghe L. Convex Optimization. Cambridge University Press, 2004
https://doi.org/10.1017/CBO9780511804441
26 Jalali A, Ravikumar P D, Sanghavi S, Ruan C. A dirty model for multitask learning. In: Proceedings of Advances in Neural Information Processing Systems. 2010, 964―972
27 Bickel P J, Ritov Y A, Tsybakov A B. Simultaneous analysis of lasso and dantzig selector. The Annals of Statistics, 2009, 37(4): 1705―1732
https://doi.org/10.1214/08-AOS620
28 Tong S. Active Learning: Theory and Applications. Stanford University, 2001
29 Davis J, Goadrich M. The relationship between precision-recall and ROC curves. In: Proceedings of the 23rd International Conference on Machine Learning. 2006, 233―240
https://doi.org/10.1145/1143844.1143874
[1] Supplementary Material-Highlights in 3-page ppt
Download
[1] Xuejun WANG, Feilong CAO, Wenjian WANG. Adaptive sparse and dense hybrid representation with nonconvex optimization[J]. Front. Comput. Sci., 2020, 14(4): 144306-.
[2] Hui XUE, Haiming XU, Xiaohong CHEN, Yunyun WANG. A primal perspective for indefinite kernel SVM problem[J]. Front. Comput. Sci., 2020, 14(2): 349-363.
[3] Xu-Ying LIU, Sheng-Tao WANG, Min-Ling ZHANG. Transfer synthetic over-sampling for class-imbalance learning with limited minority class data[J]. Front. Comput. Sci., 2019, 13(5): 996-1009.
[4] Hao SHAO. Query by diverse committee in transfer active learning[J]. Front. Comput. Sci., 2019, 13(2): 280-291.
[5] Chenchen SUN,Derong SHEN,Yue KOU,Tiezheng NIE,Ge YU. A genetic algorithm based entity resolution approach with active learning[J]. Front. Comput. Sci., 2017, 11(1): 147-159.
[6] Nengneng GAO,Sheng-Jun HUANG,Songcan CHEN. Multi-label active learning by model guided distribution matching[J]. Front. Comput. Sci., 2016, 10(5): 845-855.
[7] Hebah ELGIBREEN,Mehmet Sabih AKSOY. RULES-IT: incremental transfer learning with RULES family[J]. Front. Comput. Sci., 2014, 8(4): 537-562.
[8] Jaffer GARDEZI, Leopoldo BERTOSSI, Iluju KIRINGA. Matching dependencies: semantics and query answering[J]. Front Comput Sci, 2012, 6(3): 278-292.
[9] Suhrid BALAKRISHNAN, Sumit CHOPRA. Two of a kind or the ratings game? Adaptive pairwise preferences and latent factor models[J]. Front Comput Sci, 2012, 6(2): 197-208.
[10] Nicolas CEBRON. Active improvement of hierarchical object features under budget constraints[J]. Front Comput Sci, 2012, 6(2): 143-153.
[11] Qiang YANG, . Three challenges in data mining[J]. Front. Comput. Sci., 2010, 4(3): 324-333.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed