|
|
HC-Store: putting MapReduce’s foot in two camps |
Huiju WANG1,2,4,*( ),Furong LI4,Xuan ZHOU1,Yu CAO3,Xiongpai QIN1,2,Jidong CHEN3,Shan WANG1,2 |
1. DEKE Lab, Renmin University of China, Ministry of Education, Beijing 100872, China 2. School of Information, Renmin University of China, Beijing 100872, China 3. EMC Labs China, Beijing 100084, China 4. School of Computing, National University of Singapore, Singapore 117417, Singapore |
|
|
Abstract MapReduce is a popular framework for largescale data analysis. As data access is critical forMapReduce’s performance, some recent work has applied different storage models, such as column-store or PAX-store, to MapReduce platforms. However, the data access patterns of different queries are very different. No storagemodel is able to achieve the optimal performance alone. In this paper, we study how MapReduce can benefit from the presence of two different column-store models — pure column-store and PAX-store. We propose a hybrid storage system called hybrid columnstore (HC-store). Based on the characteristics of the incoming MapReduce tasks, our storage model can determine whether to access the underlying pure column-store or PAX-store.We studied the properties of the different storage models and create a cost model to decide the data access strategy at runtime. We have implemented HC-store on top of Hadoop. Our experimental results show that HC-store is able to outperform PAX-store and column-store, especially when confronted with diverse workload.
|
Keywords
MapReduce
Hadoop
HC-store
cost model
column-store
PAX-store
|
Corresponding Author(s):
Huiju WANG
|
Issue Date: 27 November 2014
|
|
1 |
Dean J, Ghemawat S. Mapreduce: simplified data processing on large clusters. In: Proceedings of the 6th Symposium on Operating Systems and Implementation. 2004, 137-150
|
2 |
Floratou A, Patel J M, Shekita E J, Tata S. Column-oriented storage techniques for mapreduce. In: Proceedings of the 37th International Conference on Very Large Data Bases. 2011, 4(7): 419-429
|
3 |
He Y, Lee R, Huai Y, Shao Z, Jain N, Zhang X, Xu Z. RCFile: A fast and space-efficient data placement structure in mapreduce-based warehouse systems. In: Proceedings of the IEEE 27th International Conference on Data Engineering. 2011, 1199-1208
|
4 |
Copeland G P, Khoshafian S N. A decomposition storage model. In: Proceedings of the 1985 ACM SIGMOD International Conference on Management of Data. 1985, 268-279
https://doi.org/10.1145/318898.318923
|
5 |
Abadi D J, Madden S, Hachem N. Column-stores vs. row-stores: how different are they really? In: Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data. 2008, 967-980
https://doi.org/10.1145/1376616.1376712
|
6 |
Stonebraker M, Abadi D J, Batkin A, Chen X, Cherniack M, Ferreira M, Lau E, Lin A, Madden S, O’Neil E J, O’Neil P E, Rasin A, Tran N, Zdonik S B. C-store: A column-oriented dbms. In: Proceedings of the 31st International Conference on Very Large Data Bases. 2005, 553-564
|
7 |
Pavlo A, Paulson E, Rasin A, Abadi D J, DeWitt D J, Madden S, Stonebraker M. A comparison of approaches to large-scale data analysis. In: Proceedings of the 2009 ACM SIGMOD International Conference on Management of Data. 2009, 165-178
|
8 |
Chen S. Cheetah: A high performance, custom data warehouse on top of mapreduce. Proceedings of the Very Large Data Bases Endowment, 2010, 3(2): 1459-1468
|
9 |
Lin Y, Agrawal D, Chen C, Ooi B C, Wu S. Llama: leveraging columnar storage for scalable join processing in the mapreduce framework. In: Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data. 2011, 961-972
https://doi.org/10.1145/1989323.1989424
|
10 |
Jindal A, Quiané-Ruiz J A, Dittrich J. Trojan data layouts: right shoes for a running elephant. In: Proceedings of the 2nd ACM Symposium on Cloud Computing. 2011, 21
https://doi.org/10.1145/2038916.2038937
|
11 |
Olston C, Reed B, Srivastava U, Kumar R, Tomkins A. Pig Latin: a notso-foreign language for data processing. In: Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data. 2008, 1099-1110
https://doi.org/10.1145/1376616.1376726
|
12 |
Batory D S. On searching transposed files. ACM Transactions on Database Systems, 1979, 4(4): 531-544
https://doi.org/10.1145/320107.320125
|
13 |
Ramamurthy R, DeWitt D J, Su Q. A case for fractured mirrors. The International Journal on Very Large Data Bases, 2003, 12(2): 89-101
https://doi.org/10.1007/s00778-003-0093-1
|
|
Viewed |
|
|
|
Full text
|
|
|
|
|
Abstract
|
|
|
|
|
Cited |
|
|
|
|
|
Shared |
|
|
|
|
|
Discussed |
|
|
|
|