|
|
EnAli: entity alignment across multiple heterogeneous data sources |
Chao KONG1, Ming GAO1( ), Chen XU2, Yunbin FU1, Weining QIAN1, Aoying ZHOU1 |
1. School of Data Science and Engineering, East China Normal University, Shanghai 200062, China 2. Technische Universität Berlin, Berlin 10623, Germany |
|
|
Abstract Entity alignment is the problem of identifying which entities in a data source refer to the same real-world entity in the others. Identifying entities across heterogeneous data sources is paramount to many research fields, such as data cleaning, data integration, information retrieval and machine learning. The aligning process is not only overwhelmingly expensive for large data sources since it involves all tuples from two or more data sources, but also need to handle heterogeneous entity attributes. In this paper, we propose an unsupervised approach, called EnAli, to match entities across two or more heterogeneous data sources. EnAli employs a generative probabilistic model to incorporate the heterogeneous entity attributes via employing exponential family, handle missing values, and also utilize the locality sensitive hashing schema to reduce the candidate tuples and speed up the aligning process. EnAli is highly accurate and efficient even without any ground-truth tuples. We illustrate the performance of EnAli on re-identifying entities from the same data source, as well as aligning entities across three real data sources. Our experimental results manifest that our proposed approach outperforms the comparable baseline.
|
Keywords
entity alignment
exponential family
locality sensitive hashing
EM-algorithm
|
Corresponding Author(s):
Ming GAO
|
Just Accepted Date: 29 September 2017
Online First Date: 13 June 2018
Issue Date: 31 January 2019
|
|
1 |
MScannapieco, I Figotin, EBertino, A KElmagarmid. Privacy preserving schema and data matching. In: Proceedings of ACM SIGMOD International Conference on Management of Data. 2007, 653–664
https://doi.org/10.1145/1247480.1247553
|
2 |
LGetoor, A Machanavajjhala. Entity resolution: theory, practice & open challenges. Proceedings of the VLDB Endowment, 2012, 5(12): 2018–2019
https://doi.org/10.14778/2367502.2367564
|
3 |
RZafarani, HLiu. Connecting corresponding identities across communities. In: Proceedings of International Conference on Weblogs and Social Media. 2009, 354–357
|
4 |
CTantipathananandh, T Y Berger-Wolf. Constant-factor approximation algorithms for identifying dynamic communities. In: Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2009, 827–836
https://doi.org/10.1145/1557019.1557110
|
5 |
J WZhang, P SYu. Integrated anchor and social link predictions across social networks. In: Proceedings of International Joint Conference on Artificial Intelligence. 2015, 2125–2131
|
6 |
J WZhang, P SYu. PCT: partial co-alignment of social networks. In: Proceedings of International Conference on World Wide Web. 2016, 749–759
https://doi.org/10.1145/2872427.2883038
|
7 |
MGao, E PLim, DLo, F DZhu, P KPrasetyo, A YZhou. CNL: collective network linkage across heterogeneous social network. In: Proceedings of IEEE International Conference on Data Mining. 2015, 757–762
|
8 |
CKong, MGao, CXu, W NQian, A YZhou . Entity matching across multiple heterogeneous data sources. In: Proceedings of International Conference on Database Systems for Advanced Applications. 2016, 133–146
https://doi.org/10.1007/978-3-319-32025-0_9
|
9 |
H BNewcombe, J M Kennedy, S JAxford, A PJames. Automatic linkage of vital records. Science, 1959, 130(3381): 954–959
https://doi.org/10.1126/science.130.3381.954
|
10 |
SSarawagi, A Bhamidipaty. Interactive deduplication using active learning. In: Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2002, 269–278
https://doi.org/10.1145/775047.775087
|
11 |
Y RWang, S E Madnick. The inter-database instance identification problem in integrating autonomous systems. In: Proceedings of International Conference on Data Engineering. 1989, 46–55
https://doi.org/10.1109/ICDE.1989.47199
|
12 |
M AHernandez, S JStolfo. The merge/purge problem for large databases. In: Proceedings of ACM SIGMOD International Conference on Management of Data. 1995, 127–138
https://doi.org/10.1145/223784.223807
|
13 |
LJin, CLi , SMehrotra. Supporting efficient record linkage for large data sets using mapping techniques. World Wide Web-internet & Web Information Systems, 2006, 9(4): 557–584
|
14 |
S EWhang, H Garcia-Molina. Incremental entity resolution on rules and data. The VLDB Journal, 2014, 23(1): 77–102
https://doi.org/10.1007/s00778-013-0315-0
|
15 |
LKolb, AThor, ERahm. Block-based load balancing for entity resolution with MapReduce. In: Proceedings of ACM Conference on Information and Knowledge Management. 2011, 2397–2400
https://doi.org/10.1145/2063576.2063976
|
16 |
S EWhang, H Garcia-Molina. Entity resolution with evolving rules. Proceedings of the VLDB Endowment, 2010, 3(1–2): 1326–1337
https://doi.org/10.14778/1920841.1921004
|
17 |
PSingla, P M Domingos. Entity resolution with markov logic. In: Proceedings of IEEE International Conference on Data Mining. 2006, 572–582
https://doi.org/10.1109/ICDM.2006.65
|
18 |
STejada, C A Knoblock, SMinton. Learning object identification rules for information integration. Information Systems, 2001, 26(8): 607–633
https://doi.org/10.1016/S0306-4379(01)00042-4
|
19 |
PChristen. Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution and Duplicate Detection. Berlin: Springer Heidelberg, 2012
https://doi.org/10.1007/978-3-642-31164-2
|
20 |
A KElmagarmid, P G Ipeirotis, V SVerykios. Duplicate record detection: a survey. IEEE Transactions on Knowledge and Data Engineering, 2007, 19(1): 1–16
https://doi.org/10.1109/TKDE.2007.250581
|
21 |
W EWinkler. Overview of record linkage and current research directions. Bureau of the Census, 2006, 25(4): 603–623
|
22 |
J NWang, G LLi, J XYu, J H Feng. Entity matching: how similar is similar. Proceedings of the VLDB Endowment, 2011, 4(10): 622–633
https://doi.org/10.14778/2021017.2021020
|
23 |
MBilenko, RMooney. Adaptive duplicate detection using learnable string similarity measures. In: Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2003, 39–48
https://doi.org/10.1145/956750.956759
|
24 |
XDong, A YHalevy, JMadhavan. Reference reconciliation in complex information spaces. In: Proceedings of ACM SIGMOD International Conference on Management of Data. 2005, 85–96
https://doi.org/10.1145/1066157.1066168
|
25 |
L LRoos, AWajda. Record linkage strategies. Part I: estimating information and evaluating approaches. Methods of Information in Medicine, 1991, 30(2): 117
https://doi.org/10.1055/s-0038-1634828
|
26 |
S JGrannis, J M Overhage, C JMcDonald. Analysis of identifier performance using a deterministic linkage algorithm. In: Proceedings of American Medical Informatics Association Annual Symposium. 2002, 305–309
|
27 |
VRastogi, Ni NDalvi, M NGarofalakis. Large-scale collective entity matching. Proceedings of the VLDB Endowment, 2011, 4(4): 208–218
https://doi.org/10.14778/1938545.1938546
|
28 |
SLee, JLee, SHwang. Scalable entity matching computation with materialization. In: Proceedings of ACM Conference on Information and Knowledge Management. 2011, 2353–2356
https://doi.org/10.1145/2063576.2063965
|
29 |
JLiu, FZhang, X YSong, Y I Song, C YLin, H WHon. What’s in a name? an unsupervised approach to link users across communities. In: Proceedings of ACM International Conference on Web Search and Data Mining. 2013, 495–504
|
30 |
S YLiu, S HWang, F DZhu, J B Zhang, RKrishnan. HYDRA: largescale social identity linkage via heterogeneous behavior modeling. In: Proceedings of ACM SIGMOD International Conference on Management of Data. 2014, 51–62
https://doi.org/10.1145/2588555.2588559
|
31 |
RZafarani, HLiu. Connecting users across social media sites: a behavioral-modeling approach. In: Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2013, 41–49
https://doi.org/10.1145/2487575.2487648
|
32 |
I PFellegi, A BSunter. A theory for record linkage. Journal of the American Statistical Association, 1969, 64(328): 1183–1210
https://doi.org/10.1080/01621459.1969.10501049
|
33 |
S LDuVall, R AKerber, AThomas. Extending the Fellegi-Sunter probabilistic record linkage method for approximate field comparators. Journal of Biomedical Informatics, 2010, 43(1): 24–30
https://doi.org/10.1016/j.jbi.2009.08.004
|
34 |
MSadinle, S E Fienberg. A generalized fellegi-sunter framework for multiple record linkage with application to homicide record systems. Journal of the American Statistical Association, 2013, 108(502): 385–397
https://doi.org/10.1080/01621459.2012.757231
|
35 |
PChristen. A survey of indexing techniques for scalable record linkage and deduplication. IEEE Transactions on Knowledge and Data Engineering, 2012, 24(9): 1537–1555
https://doi.org/10.1109/TKDE.2011.127
|
36 |
JLeskovec, A Rajaraman, J DUllman. Mining of Massive Datasets. Cambridge: Cambridge University Press, 2011
|
37 |
NKoudas, S Sarawagi, DSrivastava. Record linkage: similarity measures and algorithms. In: Proceedings of ACM SIGMOD International Conference on Management of Data. 2006, 802–803
https://doi.org/10.1145/1142473.1142599
|
38 |
W GZheng, LZou, Y SFeng, L Chen, D YZhao. Efficient simrankbased similarity join over large graphs. Proceedings of the VLDB Endowment, 2013, 6(7): 493–504
|
39 |
RZafarani, HLiu. Connecting users across social media sites: abehavioral-modeling approach. In: Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2013, 41–49
https://doi.org/10.1145/2487575.2487648
|
40 |
DBlei, ANg, MJordan. Latent dirichlet allocation. Journal ofMachine Learning Research, 2003, 3: 993–1022
|
|
Viewed |
|
|
|
Full text
|
|
|
|
|
Abstract
|
|
|
|
|
Cited |
|
|
|
|
|
Shared |
|
|
|
|
|
Discussed |
|
|
|
|