Please wait a minute...
Frontiers of Computer Science

ISSN 2095-2228

ISSN 2095-2236(Online)

CN 10-1014/TP

Postal Subscription Code 80-970

2018 Impact Factor: 1.129

Front. Comput. Sci.    2015, Vol. 9 Issue (6) : 919-933    https://doi.org/10.1007/s11704-015-4104-3
RESEARCH ARTICLE
RDF partitioning for scalable SPARQL query processing
Xiaoyan WANG1,2,3,Tao YANG1,Jinchuan CHEN2,*(),Long HE1,Xiaoyong DU1,2,4
1. School of Information, Renmin University of China, Beijing 100872, China
2. Key Laboratory of Data Engineering and Knowledge Engineering of Ministry of Education, Renmin University, Beijing 100872, China
3. Information Center, Supreme People’s Court, Beijing 100745, China
4. State Key Laboratory of Software Development Environment, Beihang University, Beijing 100191, China
 Download: PDF(1014 KB)  
 Export: BibTeX | EndNote | Reference Manager | ProCite | RefWorks
Abstract

The volume of RDF data increases dramatically within recent years, while cloud computing platforms like Hadoop are supposed to be a good choice for processing queries over huge data sets for their wonderful scalability. Previous work on evaluating SPARQL queries with Hadoop mainly focus on reducing the number of joins through careful split of HDFS files and algorithms for generating Map/Reduce jobs. However, the way of partitioning RDF data could also affect system performance. Specifically, a good partitioning solution would greatly reduce or even totally avoid cross-node joins, and significantly cut down the cost in query evaluation. Based on HadoopDB, this work processes SPARQL queries in a hybrid architecture, where Map/Reduce takes charge of the computing tasks, and RDF query engines like RDF-3X store the data and execute join operations. According to the analysis of query workloads, this work proposes a novel algorithm for automatically partitioning RDF data and an approximate solution to physically place the partitions in order to reduce data redundancy. It also discusses how to make a good trade-off between query evaluation efficiency and data redundancy. All of these proposed approaches have been evaluated by extensive experiments over large RDF data sets.

Keywords RDF data      data partitioning      SPARQL query     
Corresponding Author(s): Jinchuan CHEN   
Just Accepted Date: 08 May 2015   Issue Date: 10 November 2015
 Cite this article:   
Xiaoyan WANG,Tao YANG,Xiaoyong DU, et al. RDF partitioning for scalable SPARQL query processing[J]. Front. Comput. Sci., 2015, 9(6): 919-933.
 URL:  
https://academic.hep.com.cn/fcs/EN/10.1007/s11704-015-4104-3
https://academic.hep.com.cn/fcs/EN/Y2015/V9/I6/919
1 Cui B, Mei H, Ooi B. Big data: the driver for innovation in databases. National Science Review, 2014, 1(1): 27−30
https://doi.org/10.1093/nsr/nwt020
2 Husain M, McGlothlin J, Masud M, Khan L, Thuraisingham B. Heuristics based query processing for large RDF graphs using cloud computing. IEEE Transactions on Knowledge and Data Engineering, 2011, 23(9): 1312−1327
https://doi.org/10.1109/TKDE.2011.103
3 Myung J, Yeon J, Lee S. SPARQL basic graph pattern processing with iterative mapreduce. In: Proceedings of the 2010Workshop onMassive Data Analytics on the Cloud. 2010, 1−6
https://doi.org/10.1145/1779599.1779605
4 Agrawal S, Narasayya V, Yang B. Integrating vertical and horizontal partitioning into automated physical database design. In: Proceedings of the ACM SIGMOD International Conference on Management of Data. 2004, 359−370
https://doi.org/10.1145/1007568.1007609
5 Pavlo A, Curino V, Zdonik S. Skew-aware automatic database partitioning in shared-nothing parallel OLTP systems. In: Proceedings of the ACM SIGMOD International Conference on management of Data. 2012, 61−72
https://doi.org/10.1145/2213836.2213844
6 Chang C, Kurc T, Sussman A, Catalyurek U, Saltz J. A hypergraphbased workload partitioning strategy for parallel data aggregation. In: Proceedings of the 11th SIAM Conference on Parallel Processing for Scientific Computing, 2001
7 Abouzeid A, Bajda-Pawlikowski K, Abadi D, Silberschatz A, Rasin A. HadoopDB: an architectural hybrid of mapreduce and DBMS technologies for analytical workloads. The Proceedings of the VLDB Endowment, 2009, 2(1): 992−933
https://doi.org/10.14778/1687627.1687731
8 Neumann T, Weikum G. RDF-3X: a risc-style engine for RDF. The Proceedings of the VLDB Endowment, 2008, 1(1): 647−659
https://doi.org/10.14778/1453856.1453927
9 Huang J, Abadi D, Ren K. Scalable SPARQL querying of large RDF graphs. The Proceedings of the VLDB Endowment, 2011, 4(11): 1123−1134
10 Andreev K, Räcke H. Balanced graph partitioning. In: Proceedings of the 16th Annual ACM Symposium on Parallelism in Algorithms and Architectures. 2004, 120−124
https://doi.org/10.1145/1007912.1007931
11 Guo Y B, Pan Z X, Heflin J. LUBM: a benchmark for OWL knowledge base systems. Web Semantics: Science, Services and Agents on the World Wide Web, 2005, 3(2): 158−182
https://doi.org/10.1016/j.websem.2005.06.005
12 Yang T, Chen J, Wang X, Chen Y, Du X. Efficient SPARQL Query Evaluation via Automatic Data Partitioning. Technical Report. 2012
13 Sanghavi S, Shah D, Willsky A. Message passing for maximum weight independent set. IEEE Transactions on Information Theory, 2009, 55(11): 4822−4834
https://doi.org/10.1109/TIT.2009.2030448
14 Cormen T, Leiserson C, Rivest R, Stein C. Introduction to Algorithms. New York: The MIT Press, 2001
15 Getoor L, Taskar B, Koller D. Selectivity estimation using probabilistic models. In: Proceedings of the ACM SIGMOD International Conference on Management of Data. 2001, 461−472
https://doi.org/10.1145/375663.375727
16 Wilkinson K, Sayers C, Kuno H, Reynolds D. Efficient RDF storage and retrieval in Jena2. In: Proceedings of the 2nd International Semantic Web Conference. 2003, 131−150
17 Du F, Chen Y, Du X. Partitioned indexes for entity search over RDF knowledge bases. In: Proceedings of the 17th International Conference on Database Systems for Advanced Applications. 2012, 141−155
https://doi.org/10.1007/978-3-642-29038-1_12
18 Górlitz O, Thimm M, Staab S. SPLODGE: Systematic generation of SPARQL benchmark queries for linked open data. In: Proceedings of International Semantic Web-ISWC. 2012, 116−132
19 Kim H, Ravindra P, Anyanwu K. Scan-sharing for optimizing RDF graph pattern matching on mapreduce. In: Proceedings of the 5th IEEE International Conference on Cloud Computing. 2012, 139−146
https://doi.org/10.1109/cloud.2012.14
20 Zeng K, Yang J, Wang H, Shao B, Wang Z. A distributed graph engine for web scale RDF data. In: Proceedings of the 39th International Conference on Very Large Data Bases. 2013, 265−276
https://doi.org/10.14778/2535570.2488333
21 Rao J, Zhang C, Megiddo N, Lohman G. Automating physical database design in a parallel database. In: Proceedings of the ACM SIGMOD International Conference on Management of Data. 2002, 558−569
https://doi.org/10.1145/564691.564757
[1] Supplementary Material-Highlights in 3-page ppt
Download
[1] Li ZENG, Lei ZOU. Redesign of the gStore system[J]. Front. Comput. Sci., 2018, 12(4): 623-641.
[2] Guoliang CHEN, Rui MAO, Kezhong LU. A parallel computing framework for big data[J]. Front. Comput. Sci., 2017, 11(4): 608-621.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed