Please wait a minute...
Frontiers of Computer Science

ISSN 2095-2228

ISSN 2095-2236(Online)

CN 10-1014/TP

Postal Subscription Code 80-970

2018 Impact Factor: 1.129

Front. Comput. Sci.    2019, Vol. 13 Issue (1) : 157-169    https://doi.org/10.1007/s11704-017-6561-3
RESEARCH ARTICLE
EnAli: entity alignment across multiple heterogeneous data sources
Chao KONG1, Ming GAO1(), Chen XU2, Yunbin FU1, Weining QIAN1, Aoying ZHOU1
1. School of Data Science and Engineering, East China Normal University, Shanghai 200062, China
2. Technische Universität Berlin, Berlin 10623, Germany
 Download: PDF(543 KB)  
 Export: BibTeX | EndNote | Reference Manager | ProCite | RefWorks
Abstract

Entity alignment is the problem of identifying which entities in a data source refer to the same real-world entity in the others. Identifying entities across heterogeneous data sources is paramount to many research fields, such as data cleaning, data integration, information retrieval and machine learning. The aligning process is not only overwhelmingly expensive for large data sources since it involves all tuples from two or more data sources, but also need to handle heterogeneous entity attributes. In this paper, we propose an unsupervised approach, called EnAli, to match entities across two or more heterogeneous data sources. EnAli employs a generative probabilistic model to incorporate the heterogeneous entity attributes via employing exponential family, handle missing values, and also utilize the locality sensitive hashing schema to reduce the candidate tuples and speed up the aligning process. EnAli is highly accurate and efficient even without any ground-truth tuples. We illustrate the performance of EnAli on re-identifying entities from the same data source, as well as aligning entities across three real data sources. Our experimental results manifest that our proposed approach outperforms the comparable baseline.

Keywords entity alignment      exponential family      locality sensitive hashing      EM-algorithm     
Corresponding Author(s): Ming GAO   
Just Accepted Date: 29 September 2017   Online First Date: 13 June 2018    Issue Date: 31 January 2019
 Cite this article:   
Chao KONG,Ming GAO,Chen XU, et al. EnAli: entity alignment across multiple heterogeneous data sources[J]. Front. Comput. Sci., 2019, 13(1): 157-169.
 URL:  
https://academic.hep.com.cn/fcs/EN/10.1007/s11704-017-6561-3
https://academic.hep.com.cn/fcs/EN/Y2019/V13/I1/157
1 MScannapieco, I Figotin, EBertino, A KElmagarmid. Privacy preserving schema and data matching. In: Proceedings of ACM SIGMOD International Conference on Management of Data. 2007, 653–664
https://doi.org/10.1145/1247480.1247553
2 LGetoor, A Machanavajjhala. Entity resolution: theory, practice & open challenges. Proceedings of the VLDB Endowment, 2012, 5(12): 2018–2019
https://doi.org/10.14778/2367502.2367564
3 RZafarani, HLiu. Connecting corresponding identities across communities. In: Proceedings of International Conference on Weblogs and Social Media. 2009, 354–357
4 CTantipathananandh, T Y Berger-Wolf. Constant-factor approximation algorithms for identifying dynamic communities. In: Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2009, 827–836
https://doi.org/10.1145/1557019.1557110
5 J WZhang, P SYu. Integrated anchor and social link predictions across social networks. In: Proceedings of International Joint Conference on Artificial Intelligence. 2015, 2125–2131
6 J WZhang, P SYu. PCT: partial co-alignment of social networks. In: Proceedings of International Conference on World Wide Web. 2016, 749–759
https://doi.org/10.1145/2872427.2883038
7 MGao, E PLim, DLo, F DZhu, P KPrasetyo, A YZhou. CNL: collective network linkage across heterogeneous social network. In: Proceedings of IEEE International Conference on Data Mining. 2015, 757–762
8 CKong, MGao, CXu, W NQian, A YZhou . Entity matching across multiple heterogeneous data sources. In: Proceedings of International Conference on Database Systems for Advanced Applications. 2016, 133–146
https://doi.org/10.1007/978-3-319-32025-0_9
9 H BNewcombe, J M Kennedy, S JAxford, A PJames. Automatic linkage of vital records. Science, 1959, 130(3381): 954–959
https://doi.org/10.1126/science.130.3381.954
10 SSarawagi, A Bhamidipaty. Interactive deduplication using active learning. In: Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2002, 269–278
https://doi.org/10.1145/775047.775087
11 Y RWang, S E Madnick. The inter-database instance identification problem in integrating autonomous systems. In: Proceedings of International Conference on Data Engineering. 1989, 46–55
https://doi.org/10.1109/ICDE.1989.47199
12 M AHernandez, S JStolfo. The merge/purge problem for large databases. In: Proceedings of ACM SIGMOD International Conference on Management of Data. 1995, 127–138
https://doi.org/10.1145/223784.223807
13 LJin, CLi , SMehrotra. Supporting efficient record linkage for large data sets using mapping techniques. World Wide Web-internet & Web Information Systems, 2006, 9(4): 557–584
14 S EWhang, H Garcia-Molina. Incremental entity resolution on rules and data. The VLDB Journal, 2014, 23(1): 77–102
https://doi.org/10.1007/s00778-013-0315-0
15 LKolb, AThor, ERahm. Block-based load balancing for entity resolution with MapReduce. In: Proceedings of ACM Conference on Information and Knowledge Management. 2011, 2397–2400
https://doi.org/10.1145/2063576.2063976
16 S EWhang, H Garcia-Molina. Entity resolution with evolving rules. Proceedings of the VLDB Endowment, 2010, 3(1–2): 1326–1337
https://doi.org/10.14778/1920841.1921004
17 PSingla, P M Domingos. Entity resolution with markov logic. In: Proceedings of IEEE International Conference on Data Mining. 2006, 572–582
https://doi.org/10.1109/ICDM.2006.65
18 STejada, C A Knoblock, SMinton. Learning object identification rules for information integration. Information Systems, 2001, 26(8): 607–633
https://doi.org/10.1016/S0306-4379(01)00042-4
19 PChristen. Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution and Duplicate Detection. Berlin: Springer Heidelberg, 2012
https://doi.org/10.1007/978-3-642-31164-2
20 A KElmagarmid, P G Ipeirotis, V SVerykios. Duplicate record detection: a survey. IEEE Transactions on Knowledge and Data Engineering, 2007, 19(1): 1–16
https://doi.org/10.1109/TKDE.2007.250581
21 W EWinkler. Overview of record linkage and current research directions. Bureau of the Census, 2006, 25(4): 603–623
22 J NWang, G LLi, J XYu, J H Feng. Entity matching: how similar is similar. Proceedings of the VLDB Endowment, 2011, 4(10): 622–633
https://doi.org/10.14778/2021017.2021020
23 MBilenko, RMooney. Adaptive duplicate detection using learnable string similarity measures. In: Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2003, 39–48
https://doi.org/10.1145/956750.956759
24 XDong, A YHalevy, JMadhavan. Reference reconciliation in complex information spaces. In: Proceedings of ACM SIGMOD International Conference on Management of Data. 2005, 85–96
https://doi.org/10.1145/1066157.1066168
25 L LRoos, AWajda. Record linkage strategies. Part I: estimating information and evaluating approaches. Methods of Information in Medicine, 1991, 30(2): 117
https://doi.org/10.1055/s-0038-1634828
26 S JGrannis, J M Overhage, C JMcDonald. Analysis of identifier performance using a deterministic linkage algorithm. In: Proceedings of American Medical Informatics Association Annual Symposium. 2002, 305–309
27 VRastogi, Ni NDalvi, M NGarofalakis. Large-scale collective entity matching. Proceedings of the VLDB Endowment, 2011, 4(4): 208–218
https://doi.org/10.14778/1938545.1938546
28 SLee, JLee, SHwang. Scalable entity matching computation with materialization. In: Proceedings of ACM Conference on Information and Knowledge Management. 2011, 2353–2356
https://doi.org/10.1145/2063576.2063965
29 JLiu, FZhang, X YSong, Y I Song, C YLin, H WHon. What’s in a name? an unsupervised approach to link users across communities. In: Proceedings of ACM International Conference on Web Search and Data Mining. 2013, 495–504
30 S YLiu, S HWang, F DZhu, J B Zhang, RKrishnan. HYDRA: largescale social identity linkage via heterogeneous behavior modeling. In: Proceedings of ACM SIGMOD International Conference on Management of Data. 2014, 51–62
https://doi.org/10.1145/2588555.2588559
31 RZafarani, HLiu. Connecting users across social media sites: a behavioral-modeling approach. In: Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2013, 41–49
https://doi.org/10.1145/2487575.2487648
32 I PFellegi, A BSunter. A theory for record linkage. Journal of the American Statistical Association, 1969, 64(328): 1183–1210
https://doi.org/10.1080/01621459.1969.10501049
33 S LDuVall, R AKerber, AThomas. Extending the Fellegi-Sunter probabilistic record linkage method for approximate field comparators. Journal of Biomedical Informatics, 2010, 43(1): 24–30
https://doi.org/10.1016/j.jbi.2009.08.004
34 MSadinle, S E Fienberg. A generalized fellegi-sunter framework for multiple record linkage with application to homicide record systems. Journal of the American Statistical Association, 2013, 108(502): 385–397
https://doi.org/10.1080/01621459.2012.757231
35 PChristen. A survey of indexing techniques for scalable record linkage and deduplication. IEEE Transactions on Knowledge and Data Engineering, 2012, 24(9): 1537–1555
https://doi.org/10.1109/TKDE.2011.127
36 JLeskovec, A Rajaraman, J DUllman. Mining of Massive Datasets. Cambridge: Cambridge University Press, 2011
37 NKoudas, S Sarawagi, DSrivastava. Record linkage: similarity measures and algorithms. In: Proceedings of ACM SIGMOD International Conference on Management of Data. 2006, 802–803
https://doi.org/10.1145/1142473.1142599
38 W GZheng, LZou, Y SFeng, L Chen, D YZhao. Efficient simrankbased similarity join over large graphs. Proceedings of the VLDB Endowment, 2013, 6(7): 493–504
39 RZafarani, HLiu. Connecting users across social media sites: abehavioral-modeling approach. In: Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2013, 41–49
https://doi.org/10.1145/2487575.2487648
40 DBlei, ANg, MJordan. Latent dirichlet allocation. Journal ofMachine Learning Research, 2003, 3: 993–1022
[1] Yu HU, Tiezheng NIE, Derong SHEN, Yue KOU, Ge YU. An integrated pipeline model for biomedical entity alignment[J]. Front. Comput. Sci., 2021, 15(3): 153321-.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed