Reducing partition skew on MapReduce: an incremental allocation approach

doi:10.1007/s11704-018-6586-2

Front. Comput. Sci.

2019, Vol. 13

Issue (5) : 960-975 https://doi.org/10.1007/s11704-018-6586-2

RESEARCH ARTICLE

Reducing partition skew on MapReduce: an incremental allocation approach

Zhuo WANG, Qun CHEN(

), Bo SUO, Wei PAN, Zhanhuai LI

School of Computer Science and Engineering, NorthWestern Polytechnical University, Xi’an 710072, China

Download: PDF(1170 KB)
Export: BibTeX | EndNote | Reference Manager | ProCite | RefWorks

Abstract

MapReduce, a parallel computational model, has been widely used in processing big data in a distributed cluster. Consisting of alternate map and reduce phases, MapReduce has to shuffle the intermediate data generated by mappers to reducers. The key challenge of ensuring balanced workload on MapReduce is to reduce partition skew among reducers without detailed distribution information on mapped data.

In this paper, we propose an incremental data allocation approach to reduce partition skew among reducers on MapReduce. The proposed approach divides mapped data into many micro-partitions and gradually gathers the statistics on their sizes in the process of mapping. The micropartitions are then incrementally allocated to reducers in multiple rounds. We propose to execute incremental allocation in two steps, micro-partition scheduling and micro-partition allocation. We propose a Markov decision process (MDP) model to optimize the problem of multiple-round micropartition scheduling for allocation commitment. We present an optimal solution with the time complexity of O(K · N²), in which K represents the number of allocation rounds and N represents the number of micro-partitions. Alternatively, we also present a greedy but more efficient algorithm with the time complexity of O(K · N ln N). Then, we propose a minmax programming model to handle the allocation mapping between micro-partitions and reducers, and present an effective heuristic solution due to its NP-completeness. Finally, we have implemented the proposed approach on Hadoop, an open-source MapReduce platform, and empirically evaluated its performance. Our extensive experiments show that compared with the state-of-the-art approaches, the proposed approach achieves considerably better data load balance among reducers as well as overall better parallel performance.

Keywords incremental partitioning data balance MapReduce

Corresponding Author(s): Qun CHEN

Just Accepted Date: 20 December 2017 Online First Date: 07 January 2019 Issue Date: 25 June 2019

Cite this article:

Zhuo WANG,Qun CHEN,Bo SUO, et al. Reducing partition skew on MapReduce: an incremental allocation approach[J]. Front. Comput. Sci., 2019, 13(5): 960-975.

URL:

https://academic.hep.com.cn/fcs/EN/10.1007/s11704-018-6586-2
https://academic.hep.com.cn/fcs/EN/Y2019/V13/I5/960

1	J Dean, S Ghemawat. Mapreduce: simplified data processing on large clusters. Communications of the ACM, 2008, 51(1): 107–113 https://doi.org/10.1145/1327452.1327492
2	F Li, B C Ooi, M T Özsu, S Wu. Distributed data management using mapreduce. ACM Computing Surveys (CSUR), 2014, 46(3): 31 https://doi.org/10.1145/2503009
3	H Apache. Hadoop, 2009
4	J Lin. The curse of zipf and limits to parallelization: a look at the stragglers problem in mapreduce. In: Proceedings of the 7th Workshop on Large-Scale Distributed Systems for Information Retrieval. 2009, 57–62
5	K Ren, G Gibson, Y C Kwon, M Balazinska, B Howe. Hadoop’s adolescence; a comparative workloads analysis from three research clusters. In: Proceedings of the 2012 SC Companion: High Performance Computing, Networking Storage and Analysis. 2012, 1452
6	S C Racha. Load balancing map-reduce communications for efficient executions of applications in a cloud. Project Report, 2012
7	L Kolb, A Thor, E Rahm. Block-based load balancing for entity resolution with mapreduce. In: Proceedings of the 20th ACM International Conference on Information and Knowledge Management. 2011, 2397–2400 https://doi.org/10.1145/2063576.2063976
8	L Kolb, A Thor, E Rahm. Load balancing for mapreduce-based entity resolution. In: Proceedings of the 28th IEEE International Conference on Data Engineering. 2012, 618–629 https://doi.org/10.1109/ICDE.2012.22
9	B Gufler, N Augsten, A Reiser, A Kemper. Handing data skew in mapreduce. In: Proceedings of the 1st International Conference on Cloud Computing and Services Science. 2011, 574–583
10	B Gufler, N Augsten, A Reiser, A Kemper. Load balancing in mapreduce based on scalable cardinality estimates. In: Proceedings of the 28th IEEE International Conference on Data Engineering. 2012, 522–533 https://doi.org/10.1109/ICDE.2012.58
11	Q Chen, J Yao, Z Xiao. Libra: lightweight data skew mitigation in mapreduce. IEEE Transactions on Parallel and Distributed System, 2015, 26(9): 2520–2533 https://doi.org/10.1109/TPDS.2014.2350972
12	D DeWitt, M Stonebraker. Mapreduce: a major step backwards. The Database Column, 2008, 1: 23
13	Y C Kwon, M Balazinska, B Howe, J Rolia. A study of skew in mapreduce applications. Open Cirrus Summit, 2011, 11
14	A Rasmussen, M Conley, R Kapoor, U T Lam, G Porter, A Vahdat. Themis: an I/O-efficient MapReduce. In: Proceedings of the 3rd ACM Symposium on Cloud Computing. 2012, 13 https://doi.org/10.1145/2391229.2391242
15	K Ren, Y C Kwon, M Balazinska, B Howe. Hadoop’s adolescence: an analysis of hadoop usage in scientific workloads. Proceedings of the VLDB Endowment, 2013, 6(10): 853–864 https://doi.org/10.14778/2536206.2536213
16	J Shi, J Zou, J Lu, Z Cao, S Li, C Wang. Mrtuner: a toolkit to enable holistic optimization for mapreduce jobs. Proceedings of the VLDB Endowment, 2014, 7(13): 1319–1330 https://doi.org/10.14778/2733004.2733005
17	B A Shirazi, K M Kavi, A R Hurson. Scheduling and Load Balancing in Parallel and Distributed Systems. Los Alamitos: IEEE Computer Society Press, 1995
18	V Bharadwaj, D Ghose, V Mani, T G Robertazzi. Scheduling Divisible Loads in Parallel and Distributed Systems. New York: John Wiley & Sons, 1996
19	S Ibrahim, H Jin, L Lu, S Wu, B He. Leen: locality/fairness-aware key partitioning for mapreduce in the cloud. In: Proceedings of the 2nd IEEE International Conference on Cloud Computing Technology and Science. 2010, 17–24 https://doi.org/10.1109/CloudCom.2010.25
20	S Ibrahim, H Jin, L Lu, B He, G Antoniu. Handling partitioning skew in mapreduce using leen. Peer-to-Peer Networking and Applications, 2013, 6(4): 409–424 https://doi.org/10.1007/s12083-013-0213-7
21	P Dhawalia, S Kailasam, D Janakiram. Chisel: a resource savvy approach for handling skew in mapreduce applications. In: Proceedings of the 6th IEEE International Conference on Cloud Computing. 2013, 652–660 https://doi.org/10.1109/CLOUD.2013.43
22	R Vernica, A Balmin, K S Beyer, V Ercegovac. Adaptive mapreduce using situation-aware mappers. In: Proceedings of the 15th International Conference on Extending Database Technology. 2012, 420–431 https://doi.org/10.1145/2247596.2247646
23	S R Ramakrishnan, G Swart, A Urmanov. Balancing reducer skew in mapreduce workloads using progressive sampling. In: Proceedings of the 3rd ACM Symposium on Cloud Computing. 2012, 16 https://doi.org/10.1145/2391229.2391245
24	R Grover, M J Carey. Extending map-reduce for efficient predicatebased sampling. In: Proceedings of the 28th IEEE International Conference on Data Engineering. 2012, 486–497
25	Y C Kwon, M Balazinska, B Howe, J Rolia. Skewtune: mitigating skew in mapreduce applications. In: Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data. 2012, 25–36 https://doi.org/10.1145/2213836.2213840
26	P Dhawalia, S Kailasam, D Janakiram. Chisel++: handling partitioning skew in mapreduce framework using efficient range partitioning technique. In: Proceedings of the 6th International Workshop on Data Intensive Distributed Computing. 2014, 21–28 https://doi.org/10.1145/2608020.2608021
27	A Metwally, C Faloutsos. V-smart-join: a scalable mapreduce framework for all-pair similarity joins of multisets and vectors. Proceedings of the VLDB Endowment, 2012, 5(8):704–715 https://doi.org/10.14778/2212351.2212353
28	M A H Hassan, M Bamha, F Loulergue. Handling data-skew effects in join operations using mapreduce. Procedia Computer Science, 2014, 29: 145–158 https://doi.org/10.1016/j.procs.2014.05.014
29	Y C Kwon, M Balazinska, B Howe, J Rolia. Skew-resistant parallel processing of feature-extracting scientific user-defined functions. In: Proceedings of the 1st ACM Symposium on Cloud Computing. 2010, 75–86 https://doi.org/10.1145/1807128.1807140
30	W G Cochran. Sampling Techniques. New York: John Wiley & Sons, 2007
31	J D Ullman. NP-complete scheduling problems. Journal of Computer and System Sciences, 1975, 10(3): 384–393 https://doi.org/10.1016/S0022-0000(75)80008-0
32	R L Graham. Bounds on multiprocessing timing anomalies. SIAM Journal on Applied Mathematics, 1969, 17(2): 416–429 https://doi.org/10.1137/0117039
33	R L Graham. Bounds on the performance of scheduling algorithms. Computer and Job Scheduling Theory, 1976, 165–227

[1]

Article highlights

Download

[1]	Cheqing JIN, Jie CHEN, Huiping LIU. MapReduce-based entity matching with multiple blocking functions[J]. Front. Comput. Sci., 2017, 11(5): 895-911.
[2]	Xite WANG,Derong SHEN,Mei BAI,Tiezheng NIE,Yue KOU,Ge YU. SAMES: deadline-constraint scheduling in MapReduce[J]. Front. Comput. Sci., 2015, 9(1): 128-141.
[3]	Huiju WANG,Furong LI,Xuan ZHOU,Yu CAO,Xiongpai QIN,Jidong CHEN,Shan WANG. HC-Store: putting MapReduce’s foot in two camps[J]. Front. Comput. Sci., 2014, 8(6): 859-871.
[4]	Zhiwei XU , Li ZHA , Yongqiang HE , Wei LIN , . Four styles of parallel and net programming[J]. Front. Comput. Sci., 2009, 3(3): 290-301.

Viewed

Full text

Abstract

Cited

Shared

Discussed