Please wait a minute...
Frontiers of Computer Science

ISSN 2095-2228

ISSN 2095-2236(Online)

CN 10-1014/TP

Postal Subscription Code 80-970

2018 Impact Factor: 1.129

Front. Comput. Sci.    2014, Vol. 8 Issue (6) : 859-871    https://doi.org/10.1007/s11704-014-3376-3
RESEARCH ARTICLE
HC-Store: putting MapReduce’s foot in two camps
Huiju WANG1,2,4,*(),Furong LI4,Xuan ZHOU1,Yu CAO3,Xiongpai QIN1,2,Jidong CHEN3,Shan WANG1,2
1. DEKE Lab, Renmin University of China, Ministry of Education, Beijing 100872, China
2. School of Information, Renmin University of China, Beijing 100872, China
3. EMC Labs China, Beijing 100084, China
4. School of Computing, National University of Singapore, Singapore 117417, Singapore
 Download: PDF(626 KB)  
 Export: BibTeX | EndNote | Reference Manager | ProCite | RefWorks
Abstract

MapReduce is a popular framework for largescale data analysis. As data access is critical forMapReduce’s performance, some recent work has applied different storage models, such as column-store or PAX-store, to MapReduce platforms. However, the data access patterns of different queries are very different. No storagemodel is able to achieve the optimal performance alone. In this paper, we study how MapReduce can benefit from the presence of two different column-store models — pure column-store and PAX-store. We propose a hybrid storage system called hybrid columnstore (HC-store). Based on the characteristics of the incoming MapReduce tasks, our storage model can determine whether to access the underlying pure column-store or PAX-store.We studied the properties of the different storage models and create a cost model to decide the data access strategy at runtime. We have implemented HC-store on top of Hadoop. Our experimental results show that HC-store is able to outperform PAX-store and column-store, especially when confronted with diverse workload.

Keywords MapReduce      Hadoop      HC-store      cost model      column-store      PAX-store     
Corresponding Author(s): Huiju WANG   
Issue Date: 27 November 2014
 Cite this article:   
Huiju WANG,Furong LI,Xuan ZHOU, et al. HC-Store: putting MapReduce’s foot in two camps[J]. Front. Comput. Sci., 2014, 8(6): 859-871.
 URL:  
https://academic.hep.com.cn/fcs/EN/10.1007/s11704-014-3376-3
https://academic.hep.com.cn/fcs/EN/Y2014/V8/I6/859
1 Dean J, Ghemawat S. Mapreduce: simplified data processing on large clusters. In: Proceedings of the 6th Symposium on Operating Systems and Implementation. 2004, 137-150
2 Floratou A, Patel J M, Shekita E J, Tata S. Column-oriented storage techniques for mapreduce. In: Proceedings of the 37th International Conference on Very Large Data Bases. 2011, 4(7): 419-429
3 He Y, Lee R, Huai Y, Shao Z, Jain N, Zhang X, Xu Z. RCFile: A fast and space-efficient data placement structure in mapreduce-based warehouse systems. In: Proceedings of the IEEE 27th International Conference on Data Engineering. 2011, 1199-1208
4 Copeland G P, Khoshafian S N. A decomposition storage model. In: Proceedings of the 1985 ACM SIGMOD International Conference on Management of Data. 1985, 268-279
https://doi.org/10.1145/318898.318923
5 Abadi D J, Madden S, Hachem N. Column-stores vs. row-stores: how different are they really? In: Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data. 2008, 967-980
https://doi.org/10.1145/1376616.1376712
6 Stonebraker M, Abadi D J, Batkin A, Chen X, Cherniack M, Ferreira M, Lau E, Lin A, Madden S, O’Neil E J, O’Neil P E, Rasin A, Tran N, Zdonik S B. C-store: A column-oriented dbms. In: Proceedings of the 31st International Conference on Very Large Data Bases. 2005, 553-564
7 Pavlo A, Paulson E, Rasin A, Abadi D J, DeWitt D J, Madden S, Stonebraker M. A comparison of approaches to large-scale data analysis. In: Proceedings of the 2009 ACM SIGMOD International Conference on Management of Data. 2009, 165-178
8 Chen S. Cheetah: A high performance, custom data warehouse on top of mapreduce. Proceedings of the Very Large Data Bases Endowment, 2010, 3(2): 1459-1468
9 Lin Y, Agrawal D, Chen C, Ooi B C, Wu S. Llama: leveraging columnar storage for scalable join processing in the mapreduce framework. In: Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data. 2011, 961-972
https://doi.org/10.1145/1989323.1989424
10 Jindal A, Quiané-Ruiz J A, Dittrich J. Trojan data layouts: right shoes for a running elephant. In: Proceedings of the 2nd ACM Symposium on Cloud Computing. 2011, 21
https://doi.org/10.1145/2038916.2038937
11 Olston C, Reed B, Srivastava U, Kumar R, Tomkins A. Pig Latin: a notso-foreign language for data processing. In: Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data. 2008, 1099-1110
https://doi.org/10.1145/1376616.1376726
12 Batory D S. On searching transposed files. ACM Transactions on Database Systems, 1979, 4(4): 531-544
https://doi.org/10.1145/320107.320125
13 Ramamurthy R, DeWitt D J, Su Q. A case for fractured mirrors. The International Journal on Very Large Data Bases, 2003, 12(2): 89-101
https://doi.org/10.1007/s00778-003-0093-1
[1] Zhuo WANG, Qun CHEN, Bo SUO, Wei PAN, Zhanhuai LI. Reducing partition skew on MapReduce: an incremental allocation approach[J]. Front. Comput. Sci., 2019, 13(5): 960-975.
[2] Cheqing JIN, Jie CHEN, Huiping LIU. MapReduce-based entity matching with multiple blocking functions[J]. Front. Comput. Sci., 2017, 11(5): 895-911.
[3] Xite WANG,Derong SHEN,Mei BAI,Tiezheng NIE,Yue KOU,Ge YU. SAMES: deadline-constraint scheduling in MapReduce[J]. Front. Comput. Sci., 2015, 9(1): 128-141.
[4] Zhiwei XU , Li ZHA , Yongqiang HE , Wei LIN , . Four styles of parallel and net programming[J]. Front. Comput. Sci., 2009, 3(3): 290-301.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed