Please wait a minute...
Frontiers of Computer Science

ISSN 2095-2228

ISSN 2095-2236(Online)

CN 10-1014/TP

Postal Subscription Code 80-970

2018 Impact Factor: 1.129

Front. Comput. Sci.    2016, Vol. 10 Issue (2) : 317-329    https://doi.org/10.1007/s11704-015-4505-3
RESEARCH ARTICLE
Efficient graph similarity join for information integration on graphs
Yue WANG,Hongzhi WANG(),Jianzhong LI,Hong GAO
Department of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China
 Download: PDF(553 KB)  
 Export: BibTeX | EndNote | Reference Manager | ProCite | RefWorks
Abstract

Graphs have been widely used for complex data representation in many real applications, such as social network, bioinformatics, and computer vision. Therefore, graph similarity join has become imperative for integrating noisy and inconsistent data from multiple data sources. The edit distance is commonly used to measure the similarity between graphs. The graph similarity join problem studied in this paper is based on graph edit distance constraints. To accelerate the similarity join based on graph edit distance, in the paper, we make use of a preprocessing strategy to remove the mismatching graph pairs with significant differences. Then a novel method of building indexes for each graph is proposed by grouping the nodes which can be reached in k hops for each key node with structure conservation, which is the k-hop tree based indexing method. As for each candidate pair, we propose a similarity computation algorithm with boundary filtering, which can be applied with good efficiency and effectiveness. Experiments on real and synthetic graph databases also confirm that our method can achieve good join quality in graph similarity join. Besides, the join process can be finished in polynomial time.

Keywords graph similarity join      edit distance constraint      khop tree based indexing      structure conservation      boundary filtering     
Corresponding Author(s): Hongzhi WANG   
Just Accepted Date: 21 September 2015   Issue Date: 16 March 2016
 Cite this article:   
Yue WANG,Hongzhi WANG,Jianzhong LI, et al. Efficient graph similarity join for information integration on graphs[J]. Front. Comput. Sci., 2016, 10(2): 317-329.
 URL:  
https://academic.hep.com.cn/fcs/EN/10.1007/s11704-015-4505-3
https://academic.hep.com.cn/fcs/EN/Y2016/V10/I2/317
1 Zhao X, Xiao C, Lin X, Wang W. Efficient graph similarity joins with edit distance constraints. In: Proceedings of the 28th IEEE International Conference on Data Engineer. 2012, 834–845
https://doi.org/10.1109/icde.2012.91
2 Qin J, Wang W, Lu Y, Xiao C, Lin X. Efficient exact edit similarity query processing with the asymmetric signature schemes. In: Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data. 2011, 1033–1044
https://doi.org/10.1145/1989323.1989431
3 Fan W, Li J, Ma S, Tang N, Wu Y. Graph pattern matching: from intractable to polynomial time. Proceedings of the VLDB Endowment, 2011, 3(1): 264–275
4 Ma S, Cao Y, Fan W, Huai J, Wo T. Capturing topology in graph pattern matching. Proceedings of the VLDB Endowment, 2011, 5(4): 310–321
https://doi.org/10.14778/2095686.2095690
5 Sanfeliu A, Fu K S. A distance measure between attributed relational graphs for pattern recognition. IEEE Transactions on Systems, Man, and Cybernetics, 1983, 13(3): 353–362
https://doi.org/10.1109/TSMC.1983.6313167
6 Bunke H, Allermann G. Inexact graph matching for structural pattern recognition. Pattern Recognition Letters, 1983, 1(4): 245–253
https://doi.org/10.1016/0167-8655(83)90033-8
7 Gouda K, Arafa M. An improved global lower bound for graph edit similarity search. Pattern Recognition Letters, 2015 58: 8–14
https://doi.org/10.1016/j.patrec.2015.02.004
8 Ibragimov R. Exact and heuristic algorithms for network alignment using graph edit distance models. Dissertation for the Doctoral Degree. Fachrichtung 6.2 – Informatik, 2015
9 Baumbach J, Guo J, Ibragimov R. Multiple graph edit distance: simultaneous topological alignment of multiple protein-protein interaction networks with an evolutionary algorithm. In: Proceedings of the 2014 Conference on Genetic and Evolutionary Computation. 2014: 277–284
10 Justice D, Hero A. A binary linear programming formulation of the graph edit distance. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2006, 28(8): 1200–1214
https://doi.org/10.1109/TPAMI.2006.152
11 Fankhauser S, Riesen K, Bunke H. Speeding up graph edit distance computation through fast bipartite matching. In: Proceedings of the 8th International Workshop on Graph-Based Representations in Pattern Recognition. 2011, 102–111
https://doi.org/10.1007/978-3-642-20844-7_11
12 Wang G, Wang B, Yang X, G. Yu G. Efficiently indexing large sparse graphs for similarity search. IEEE Transactions on Knowledge and Data Engineering, 2012, 24(3): 440–451
https://doi.org/10.1109/TKDE.2010.28
13 Wang Y, Wang H, Li J, Gao H. Graph similarity join with k-hop tree indexing. In: Proceedings of the International Conference of Young Computer Scientists, Engineers and Educators. 2015, 38–47
https://doi.org/10.1007/978-3-662-46248-5_6
14 Zaki M J. Efficiently mining frequent trees in a forest: algorithms and applications. IEEE Transactions on Knowledge and Data Engineering, 2005, 17(8): 1021–1035
https://doi.org/10.1109/TKDE.2005.125
15 Gao X, Xiao B, Tao D, Li X. A survey of graph edit distance. Pattern Analysis and Applications, 2010, 13(1): 113–129
https://doi.org/10.1007/s10044-008-0141-y
16 Conte D, Ramel JY, Sidère N, Luqman MM, Gaüzère B, Gibert J, Brun L, Vento M. A comparison of explicit and implicit graph embedding methods for pattern recognition. In: Proceedings of the 9th International Workshop on Graph-Based Representations in Pattern Recognition. 2013, 81–90
https://doi.org/10.1007/978-3-642-38221-5_9
17 Shao Y, Cui B, Chen L, Liu M, Xie X. An efficient similarity search framework for SimRank over large dynamic graphs. Proceedings of the VLDB Endowment, 2015, 8(8): 838–849
https://doi.org/10.14778/2757807.2757809
18 Shao Y, Cui M, Ma L. PAGE: a partition aware engine for parallel graph computation. IEEE Transactions on Knowledge and Data Engineering, 2015, 27(2): 518–530
https://doi.org/10.1109/TKDE.2014.2327037
19 Xu N, Chen L, Cui B. LogGP: a log-based dynamic graph partitioning method. Proceedings of the VLDB Endowment, 2014, 7(14): 1917–1928
https://doi.org/10.14778/2733085.2733097
20 Shao Y, Chen L, Cui B. Efficient cohesive subgraphs detection in parallel. In: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data. 2014, 613–624
https://doi.org/10.1145/2588555.2593665
21 Shao Y, Cui B, Chen L, Ma L, Yao J, Xu N. Parallel subgraph listing in a large-scale graph. In: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data. 2014, 625–636
https://doi.org/10.1145/2588555.2588557
22 Shao Y, Yao J, Cui B, Ma L. PAGE: a partition aware graph computation engine. In: Proceedings of the 22nd ACM International Conference on Information and Knowledge Management. 2013, 823–828
https://doi.org/10.1145/2505515.2505617
23 Cui B, Mei H, Ooi B C. Big data: the driver for innovation in databases. National Science Review, 2014, 1(1): 27–30
https://doi.org/10.1093/nsr/nwt020
24 Shang H, Lin X, Zhang Y, Yu J X, Wang W. Connected substructure similarity search. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data. 2010, 903–914
https://doi.org/10.1145/1807167.1807264
25 Yan X, Yu P S, Han J. Substructure similarity search in graph databases. In: Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data. 2005, 766–777
https://doi.org/10.1145/1066157.1066244
26 Zhu Y, Qin L, Yu J X, Ke Y, Lin X. High efficiency and quality: large graphs matching. The VLDB Journal — The International Journal on Very Large Data Bases, 2013, 22(3): 345–368
27 Williams D W, Huan J, Wang W. Graph database indexing using structured graph decomposition. In: Proceedings of the 23rd IEEE International Conference on Data Engineering. 2007, 976–985
https://doi.org/10.1109/icde.2007.368956
28 Zou L, Chen L, Özsu M T. Distance-join: pattern match query in a large graph databases. Proceedings of the VLDB Endowment, 2009, 2(1): 886–897
https://doi.org/10.14778/1687627.1687727
29 Zeng Z, Tung A K, Wang J, Feng J, Zhou L. Comparing stars: on approximating graph edit distance. Proceedings of the VLDB Endowment, 2009, 2(1): 25–36
https://doi.org/10.14778/1687627.1687631
30 Zheng W, Zou L, Feng Y, Chen L, Zhao D. Efficient SimRank-based similarity join over large graphs. Proceedings of the VLDB Endowment, 2013, 6(7): 493–504
https://doi.org/10.14778/2536349.2536350
[1] FCS-0317-14505-HZW_suppl_1 Download
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed