1. DEKE Lab, Renmin University of China, Ministry of Education, Beijing 100872, China 2. School of Information, Renmin University of China, Beijing 100872, China 3. EMC Labs China, Beijing 100084, China 4. School of Computing, National University of Singapore, Singapore 117417, Singapore
MapReduce is a popular framework for largescale data analysis. As data access is critical forMapReduce’s performance, some recent work has applied different storage models, such as column-store or PAX-store, to MapReduce platforms. However, the data access patterns of different queries are very different. No storagemodel is able to achieve the optimal performance alone. In this paper, we study how MapReduce can benefit from the presence of two different column-store models — pure column-store and PAX-store. We propose a hybrid storage system called hybrid columnstore (HC-store). Based on the characteristics of the incoming MapReduce tasks, our storage model can determine whether to access the underlying pure column-store or PAX-store.We studied the properties of the different storage models and create a cost model to decide the data access strategy at runtime. We have implemented HC-store on top of Hadoop. Our experimental results show that HC-store is able to outperform PAX-store and column-store, especially when confronted with diverse workload.
Dean J, Ghemawat S. Mapreduce: simplified data processing on large clusters. In: Proceedings of the 6th Symposium on Operating Systems and Implementation. 2004, 137-150
2
Floratou A, Patel J M, Shekita E J, Tata S. Column-oriented storage techniques for mapreduce. In: Proceedings of the 37th International Conference on Very Large Data Bases. 2011, 4(7): 419-429
3
He Y, Lee R, Huai Y, Shao Z, Jain N, Zhang X, Xu Z. RCFile: A fast and space-efficient data placement structure in mapreduce-based warehouse systems. In: Proceedings of the IEEE 27th International Conference on Data Engineering. 2011, 1199-1208
4
Copeland G P, Khoshafian S N. A decomposition storage model. In: Proceedings of the 1985 ACM SIGMOD International Conference on Management of Data. 1985, 268-279
https://doi.org/10.1145/318898.318923
5
Abadi D J, Madden S, Hachem N. Column-stores vs. row-stores: how different are they really? In: Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data. 2008, 967-980
https://doi.org/10.1145/1376616.1376712
6
Stonebraker M, Abadi D J, Batkin A, Chen X, Cherniack M, Ferreira M, Lau E, Lin A, Madden S, O’Neil E J, O’Neil P E, Rasin A, Tran N, Zdonik S B. C-store: A column-oriented dbms. In: Proceedings of the 31st International Conference on Very Large Data Bases. 2005, 553-564
7
Pavlo A, Paulson E, Rasin A, Abadi D J, DeWitt D J, Madden S, Stonebraker M. A comparison of approaches to large-scale data analysis. In: Proceedings of the 2009 ACM SIGMOD International Conference on Management of Data. 2009, 165-178
8
Chen S. Cheetah: A high performance, custom data warehouse on top of mapreduce. Proceedings of the Very Large Data Bases Endowment, 2010, 3(2): 1459-1468
9
Lin Y, Agrawal D, Chen C, Ooi B C, Wu S. Llama: leveraging columnar storage for scalable join processing in the mapreduce framework. In: Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data. 2011, 961-972
https://doi.org/10.1145/1989323.1989424
10
Jindal A, Quiané-Ruiz J A, Dittrich J. Trojan data layouts: right shoes for a running elephant. In: Proceedings of the 2nd ACM Symposium on Cloud Computing. 2011, 21
https://doi.org/10.1145/2038916.2038937
11
Olston C, Reed B, Srivastava U, Kumar R, Tomkins A. Pig Latin: a notso-foreign language for data processing. In: Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data. 2008, 1099-1110
https://doi.org/10.1145/1376616.1376726
Ramamurthy R, DeWitt D J, Su Q. A case for fractured mirrors. The International Journal on Very Large Data Bases, 2003, 12(2): 89-101
https://doi.org/10.1007/s00778-003-0093-1