GCSS: a global collaborative scheduling strategy for wide-area high-performance computing

doi:10.1007/s11704-021-0353-5

Frontiers of Computer Science

2022, Vol. 16

Issue (5): 165105 https://doi.org/10.1007/s11704-021-0353-5

本期目录

GCSS: a global collaborative scheduling strategy for wide-area high-performance computing

Yao SONG^1,², Limin XIAO^1,²(

), Liang WANG², Guangjun QIN³(

), Bing WEI^1,², Baicheng YAN^1,², Chenhao ZHANG^1,²

¹. State Key Laboratory of Software Development Environment, Beihang University, Beijing 100191, China
². School of Computer Science and Engineering, Beihang University, Beijing 100191, China
³. Smart City College, Beijing Union University, Beijing 100101, China

全文: PDF(15795 KB) HTML

Abstract：

Wide-area high-performance computing is widely used for large-scale parallel computing applications owing to its high computing and storage resources. However, the geographical distribution of computing and storage resources makes efficient task distribution and data placement more challenging. To achieve a higher system performance, this study proposes a two-level global collaborative scheduling strategy for wide-area high-performance computing environments. The collaborative scheduling strategy integrates lightweight solution selection, redundant data placement and task stealing mechanisms, optimizing task distribution and data placement to achieve efficient computing in wide-area environments. The experimental results indicate that compared with the state-of-the-art collaborative scheduling algorithm HPS+, the proposed scheduling strategy reduces the makespan by 23.24%, improves computing and storage resource utilization by 8.28% and 21.73% respectively, and achieves similar global data migration costs.

Key words： high-performance computing scheduling strategy task scheduling data placement

收稿日期: 2020-07-16 出版日期: 2021-12-31

Corresponding Author(s): Limin XIAO,Guangjun QIN

引用本文:

. [J]. Frontiers of Computer Science, 2022, 16(5): 165105.
Yao SONG, Limin XIAO, Liang WANG, Guangjun QIN, Bing WEI, Baicheng YAN, Chenhao ZHANG. GCSS: a global collaborative scheduling strategy for wide-area high-performance computing. Front. Comput. Sci., 2022, 16(5): 165105.

链接本文:

https://academic.hep.com.cn/fcs/CN/10.1007/s11704-021-0353-5
https://academic.hep.com.cn/fcs/CN/Y2022/V16/I5/165105

Fig.1

Parameter	Description
$A k$	application requirements of task $k$
$B i, j, k$	data migration bandwidth for task $k$ in solution $(i, j)$
$C j$	total cores in center $j$
$C a v a i l j$	number of available cores in center $j$
$C k$	cores required by task $k$
$C C j$	average computing capacity of each core in center $j$
$F k$	task stealing value of task $k$
$I k$	source data center set for task $k$
$J k$	target data center set for task $k$
$P j, k$	data placement value of data for task $k$ in center $j$
$S j$	total storage resources in center $j$
$S r e m a i n j$	idle storage resources in center $j$
$S k$	storage resource requirements of task $k$
$T t r a n s i, j, k$	estimated data migration time for task $k$ in solution $(i, j)$
$T w a i t j, k$	estimated waiting time of task $k$ in center $j$
$T c o m p j, k$	estimated computing time of task $k$ in center $j$
$T s t a r t j, k$	computing start time of task $k$ in center $j$
$T i, j, k$	completion time of task $k$ in solution $(i, j)$
$W c o m p k$	estimated computing quantity for task $k$
$f j, k$	accessed proportion of data for task $k$ in center $j$
$t j, k$	access time interval of data for task $k$ in center $j$
$p r i o r i t y k$	priority of task $k$
$? j, k$	resource relevance parameter representing task $k$ 's requirements and resource status in center $j$

Tab.1

Fig.2

Fig.3

Tab.2

Fig.4

Tab.3

	HPS+	CADT	CAMS	DSS	GCSS
Task scheduling	$?$	$?$	$?$	$?$	$?$
Data placement	$?$	$?$	$?$	$×$	$?$
Data replica	$×$	$×$	$?$	$×$	$?$
Bandwidth allocation	$?$	$×$	$×$	$×$	$×$

Tab.4

Fig.5

Tab.5

Fig.6

Fig.7

Tab.6

Fig.8

Fig.9

1	J Towns , T Cockerill , M Dahan , I Foster , K Gaither , A Grimshaw , V Hazlewood , S Lathrop , D Lifka , G D Peterson , R Roskies , J R Scott , N Wilkins-Diehr . XSEDE: accelerating scientific discovery. Computing in Science & Engineering, 2014, 16( 5): 62– 74
2	X Xie, N Xiao, Z Xu, L Zha, W Li, H Yu. CNGrid software 2: service oriented approach to grid computing. In: Proceedings of the UK e-Science All Hands Meeting. 2005, 701– 708
3	W C Skamarock, J B Klemp, J Dudhia, D O Gill, J G Powers. A description of the Advanced Research WRF version 2. NCAR/TN-468+STR. Boulder: National Center for Atmospheric Research, 2005
4	T Kosar , M Balman . A new paradigm: data-aware scheduling in grid computing. Future Generation Computer Systems, 2009, 25( 4): 406– 413
5	M Chowdhury , M Zaharia , J Ma , M I Jordan , I Stoica . Managing data transfers in computer clusters with orchestra. ACM SIGCOMM Computer Communication Review, 2011, 41( 4): 98– 109
6	K Wang , K Qiao , I Sadooghi , X Zhou , T Li , M Lang , I Raicu . Load-balanced and locality-aware scheduling for data-intensive workloads at extreme scales. Concurrency and Computation: Practice and Experience, 2016, 28( 1): 70– 94
7	S Kang , B Veeravalli , K M M Aung . Dynamic scheduling strategy with efficient node availability prediction for handling divisible loads in multi-cloud systems. Journal of Parallel and Distributed Computing, 2018, 113 : 1– 16
8	W Wei, B Li, B Liang, J Li. Multi-resource fair sharing for datacenter jobs with placement constraints. In: Proceedings of International Conference for High Performance Computing, Networking, Storage and Analysis. 2016, 1003−1014
9	T Buddhika , R Stern , K Lindburg , K Ericson , S Pallickara . Online scheduling and interference alleviation for low-latency, high-throughput processing of data streams. IEEE Transactions on Parallel and Distributed Systems, 2017, 28( 12): 3553– 3569
10	N Kremer-Herman, B Tovar, D Thain. A lightweight model for right-sizing master-worker applications. In: Proceedings of International Conference for High Performance Computing, Networking, Storage and Analysis. 2018, 504– 516
11	E Gaussier , J Lelong , V Reis , D Trystram . Online tuning of EASY-backfilling using queue reordering policies. IEEE Transactions on Parallel and Distributed Systems, 2018, 29( 10): 2304– 2316
12	D Carastan-Santos, R Y De Camargo. Obtaining dynamic scheduling policies with simulation and machine learning. In: Proceedings of International Conference for High Performance Computing, Networking, Storage and Analysis. 2017, 32
13	C Y Chen . Task scheduling for maximizing performance and reliability considering fault recovery in heterogeneous distributed systems. IEEE Transactions on Parallel and Distributed Systems, 2016, 27( 2): 521– 532
14	H Xu , W C Lau . Optimization for speculative execution in big data processing clusters. IEEE Transactions on Parallel and Distributed Systems, 2017, 28( 2): 530– 545
15	S He , Y Wang , X Sun . Boosting parallel file system performance via heterogeneity-aware selective data layout. IEEE Transactions on Parallel and Distributed Systems, 2016, 27( 9): 2492– 2505
16	D G Cameron, R Carvajal-Schiaffino, A P Millar, C Nicholson, K Stockinger, F Zini. Evaluating scheduling and replica optimisation strategies in OptorSim. In: Proceedings of the 1st Latin American Web Congress. 2003, 52– 59
17	P Bryk , M Malawski , G Juve , E Deelman . Storage-aware algorithms for scheduling of workflow ensembles in clouds. Journal of Grid Computing, 2016, 14( 2): 359– 378
18	Mon E E, Thein M M, Aung M T. Clustering based on task dependency for data-intensive workflow scheduling optimization. In: Proceedings of the 9th Workshop on Many-Task Computing on Clouds, Grids, and Supercomputers. 2016, 20–25
19	C Szabo , Q Z Sheng , T Kroeger , Y Zhang , J Yu . Science in the cloud: allocation and execution of data-intensive scientific workflows. Journal of Grid Computing, 2014, 12( 2): 245– 264
20	L Zhao , Y Yang , A Munir , A X Liu , W Qu . Optimizing geo-distributed data analytics with coordinated task scheduling and routing. IEEE Transactions on Parallel and Distributed Systems, 2020, 31( 2): 279– 293
21	S Wang , W Chen , X Zhou , L Zhang , Y Wang . Dependency-aware network adaptive scheduling of data-intensive parallel jobs. IEEE Transactions on Parallel and Distributed Systems, 2019, 30( 3): 515– 529
22	X Wei , L Li , X Li , X Wang , S Gao , H Li . Pec: proactive elastic collaborative resource scheduling in data stream processing. IEEE Transactions on Parallel and Distributed Systems, 2019, 30( 7): 1628– 1642
23	C Li , J Bai , J Tang . Joint optimization of data placement and scheduling for improving user experience in edge computing. Journal of Parallel and Distributed Computing, 2019, 125 : 93– 105
24	F Liu, K Keahey, P Riteau, J Weissman. Dynamically negotiating capacity between on-demand and batch clusters. In: Proceedings of International Conference for High Performance Computing, Networking, Storage and Analysis. 2018, 493– 503
25	J Frey, T Tannenbaum, M Livny, I Foster, S Tuecke. Condor-G: a computation management agent for multi-institutional grids. In: Proceedings of the 10th IEEE International Symposium on High Performance Distributed Computing. 2001
26	S Wang, X Zhang, K Yang, L Wang, W Wang. Distributed edge caching scheme considering the tradeoff between the diversity and redundancy of cached content. In: Proceedings of 2015 IEEE/CIC International Conference on Communications in China. 2015, 1– 5
27	D Yuan , Y Yang , X Liu , J Chen . A data placement strategy in scientific cloud workflows. Future Generation Computer Systems, 2010, 26( 8): 1200– 1214
28	J Edinger, D Schäfer, C Krupitzer, V Raychoudhury, C Becker. Fault-avoidance strategies for context-aware schedulers in pervasive computing systems. In: Proceedings of 2017 IEEE International Conference on Pervasive Computing and Communications. 2017, 79– 88
29	Schafer D, Edinger J, Paluska J M, Vansyckel S, Becker C. Tasklets: “better than best-effort” computing. In: Proceedings of the 25th International Conference on Computer Communication and Networks. 2016, 1–11
30	M Breitbach, D Schäfer, J Edinger, C Becker. Context-aware data and task placement in edge computing environments. In: Proceedings of 2019 IEEE International Conference on Pervasive Computing and Communications. 2019, 1– 10
31	T Wang , J Zhou , G Zhang , T Wei , S Hu . Customer perceived value- and risk-aware multiserver configuration for profit maximization. IEEE Transactions on Parallel and Distributed Systems, 2020, 31( 5): 1074– 1088
32	Z Xu, C Stewart, N Deng, X Wang. Blending on-demand and spot instances to lower costs for in-memory storage. In: Proceedings of the 35th Annual IEEE International Conference on Computer Communications. 2016, 1– 9
33	Zheng N, Chen Q, Yang Y, Li J, Zheng W, Guo M. POSTER: precise capacity planning for database public clouds. In: Proceedings of the 28th International Conference on Parallel Architectures and Compilation Techniques. 2019, 457–458
34	V Bharadwaj , D Ghose , T G Robertazzi . Divisible load theory: a new paradigm for load scheduling in distributed systems. Cluster Computing, 2003, 6( 1): 7– 17
35	R McKenna, S Herbein, A Moody, T Gamblin, M Taufer. Machine learning predictions of runtime and IO traffic on high-end clusters. In: Proceedings of 2016 IEEE International Conference on Cluster Computing. 2016, 255– 258
36	H Casanova, A Legrand, M Quinson. SimGrid: a generic framework for large-scale distributed experiments. In: Proceedings of the 10th International Conference on Computer Modeling and Simulation. 2008, 126– 131

[1]

Highlights

Download

Viewed

Full text

Abstract

Cited

Shared

Discussed