1. State Key Laboratory of Software Development Environment, Beihang University, Beijing 100191, China 2. School of Computer Science and Engineering, Beihang University, Beijing 100191, China 3. Smart City College, Beijing Union University, Beijing 100101, China
Wide-area high-performance computing is widely used for large-scale parallel computing applications owing to its high computing and storage resources. However, the geographical distribution of computing and storage resources makes efficient task distribution and data placement more challenging. To achieve a higher system performance, this study proposes a two-level global collaborative scheduling strategy for wide-area high-performance computing environments. The collaborative scheduling strategy integrates lightweight solution selection, redundant data placement and task stealing mechanisms, optimizing task distribution and data placement to achieve efficient computing in wide-area environments. The experimental results indicate that compared with the state-of-the-art collaborative scheduling algorithm HPS+, the proposed scheduling strategy reduces the makespan by 23.24%, improves computing and storage resource utilization by 8.28% and 21.73% respectively, and achieves similar global data migration costs.
estimated data migration time for task in solution
estimated waiting time of task in center
estimated computing time of task in center
computing start time of task in center
completion time of task in solution
estimated computing quantity for task
accessed proportion of data for task in center
access time interval of data for task in center
priority of task
resource relevance parameter representing task 's requirements and resource status in center
Tab.1
Fig.2
Fig.3
Data center
Core number
Core computing capacity/GFlops
DC1
3500
22.89
DC2
3200
29.43
DC3
2600
14.55
DC4
2700
22.89
DC5
2800
14.55
Tab.2
Fig.4
Name
Starting Time
Max core requirement
CTC-SP2
1996.03.01
308
SDSC-DS
2004.02.01
1360
CEA-Curie
2011.02.01
2560
Tab.3
HPS+
CADT
CAMS
DSS
GCSS
Task scheduling
Data placement
Data replica
Bandwidth allocation
Tab.4
Fig.5
Trace
Algorithm
Makespan of different input task numbers/s
500
1000
1500
2000
2500
CTC
GCSS
23294.70
27072.36
27072.36
32064.36
52278.23
HPS+
23483.83
28018.56
32680.13
45532.16
68102.33
CADT
23534.31
28370.69
51680.07
74013.17
91357.67
CAMS
60032.69
93366.80
106422.44
132500.95
168278.29
DSS
63621.97
144647.44
229714.13
307202.79
381714.16
FIFO
145525.05
270427.54
388680.07
496644.02
594694.38
SDSC
GCSS
45002.67
75000.53
104094.65
122001.59
135672.46
HPS+
60037.27
91149.37
112037.86
145001.99
168628.91
CADT
55040.86
99149.06
112024.88
149003.02
187038.23
CAMS
73082.97
128001.85
166037.69
228001.95
260335.96
DSS
53000.30
134015.11
237012.59
292500.75
329500.84
FIFO
155052.81
322680.00
490024.14
599009.51
695003.40
CEA
GCSS
16379.88
32403.52
83210.37
106435.17
189821.53
HPS+
45601.52
85010.38
220032.02
290035.17
500110.57
CADT
35515.36
69402.31
129217.18
161435.17
258602.19
CAMS
64382.38
118003.64
254020.83
328030.54
522000.20
DSS
74500.41
213500.23
321000.23
382503.66
550500.83
FIFO
137594.62
329010.83
490018.56
603030.52
842013.76
Tab.5
Fig.6
Fig.7
Algorithm
Global data migration cost(s)
CTC
SDSC
CEA
GCSS
1966350
1855279
551576
HPS+
641278
4566047
403164
CADT
2988213
3744135
2424097
CAMS
4141334
5900437
3544304
DSS
17510440
7043739
5372566
FIFO
22533680
24543810
14834070
Tab.6
Fig.8
Fig.9
1
J Towns , T Cockerill , M Dahan , I Foster , K Gaither , A Grimshaw , V Hazlewood , S Lathrop , D Lifka , G D Peterson , R Roskies , J R Scott , N Wilkins-Diehr . XSEDE: accelerating scientific discovery. Computing in Science & Engineering, 2014, 16( 5): 62– 74
2
X Xie, N Xiao, Z Xu, L Zha, W Li, H Yu. CNGrid software 2: service oriented approach to grid computing. In: Proceedings of the UK e-Science All Hands Meeting. 2005, 701– 708
3
W C Skamarock, J B Klemp, J Dudhia, D O Gill, J G Powers. A description of the Advanced Research WRF version 2. NCAR/TN-468+STR. Boulder: National Center for Atmospheric Research, 2005
4
T Kosar , M Balman . A new paradigm: data-aware scheduling in grid computing. Future Generation Computer Systems, 2009, 25( 4): 406– 413
5
M Chowdhury , M Zaharia , J Ma , M I Jordan , I Stoica . Managing data transfers in computer clusters with orchestra. ACM SIGCOMM Computer Communication Review, 2011, 41( 4): 98– 109
6
K Wang , K Qiao , I Sadooghi , X Zhou , T Li , M Lang , I Raicu . Load-balanced and locality-aware scheduling for data-intensive workloads at extreme scales. Concurrency and Computation: Practice and Experience, 2016, 28( 1): 70– 94
7
S Kang , B Veeravalli , K M M Aung . Dynamic scheduling strategy with efficient node availability prediction for handling divisible loads in multi-cloud systems. Journal of Parallel and Distributed Computing, 2018, 113 : 1– 16
8
W Wei, B Li, B Liang, J Li. Multi-resource fair sharing for datacenter jobs with placement constraints. In: Proceedings of International Conference for High Performance Computing, Networking, Storage and Analysis. 2016, 1003−1014
9
T Buddhika , R Stern , K Lindburg , K Ericson , S Pallickara . Online scheduling and interference alleviation for low-latency, high-throughput processing of data streams. IEEE Transactions on Parallel and Distributed Systems, 2017, 28( 12): 3553– 3569
10
N Kremer-Herman, B Tovar, D Thain. A lightweight model for right-sizing master-worker applications. In: Proceedings of International Conference for High Performance Computing, Networking, Storage and Analysis. 2018, 504– 516
11
E Gaussier , J Lelong , V Reis , D Trystram . Online tuning of EASY-backfilling using queue reordering policies. IEEE Transactions on Parallel and Distributed Systems, 2018, 29( 10): 2304– 2316
12
D Carastan-Santos, R Y De Camargo. Obtaining dynamic scheduling policies with simulation and machine learning. In: Proceedings of International Conference for High Performance Computing, Networking, Storage and Analysis. 2017, 32
13
C Y Chen . Task scheduling for maximizing performance and reliability considering fault recovery in heterogeneous distributed systems. IEEE Transactions on Parallel and Distributed Systems, 2016, 27( 2): 521– 532
14
H Xu , W C Lau . Optimization for speculative execution in big data processing clusters. IEEE Transactions on Parallel and Distributed Systems, 2017, 28( 2): 530– 545
15
S He , Y Wang , X Sun . Boosting parallel file system performance via heterogeneity-aware selective data layout. IEEE Transactions on Parallel and Distributed Systems, 2016, 27( 9): 2492– 2505
16
D G Cameron, R Carvajal-Schiaffino, A P Millar, C Nicholson, K Stockinger, F Zini. Evaluating scheduling and replica optimisation strategies in OptorSim. In: Proceedings of the 1st Latin American Web Congress. 2003, 52– 59
17
P Bryk , M Malawski , G Juve , E Deelman . Storage-aware algorithms for scheduling of workflow ensembles in clouds. Journal of Grid Computing, 2016, 14( 2): 359– 378
18
Mon E E, Thein M M, Aung M T. Clustering based on task dependency for data-intensive workflow scheduling optimization. In: Proceedings of the 9th Workshop on Many-Task Computing on Clouds, Grids, and Supercomputers. 2016, 20–25
19
C Szabo , Q Z Sheng , T Kroeger , Y Zhang , J Yu . Science in the cloud: allocation and execution of data-intensive scientific workflows. Journal of Grid Computing, 2014, 12( 2): 245– 264
20
L Zhao , Y Yang , A Munir , A X Liu , W Qu . Optimizing geo-distributed data analytics with coordinated task scheduling and routing. IEEE Transactions on Parallel and Distributed Systems, 2020, 31( 2): 279– 293
21
S Wang , W Chen , X Zhou , L Zhang , Y Wang . Dependency-aware network adaptive scheduling of data-intensive parallel jobs. IEEE Transactions on Parallel and Distributed Systems, 2019, 30( 3): 515– 529
22
X Wei , L Li , X Li , X Wang , S Gao , H Li . Pec: proactive elastic collaborative resource scheduling in data stream processing. IEEE Transactions on Parallel and Distributed Systems, 2019, 30( 7): 1628– 1642
23
C Li , J Bai , J Tang . Joint optimization of data placement and scheduling for improving user experience in edge computing. Journal of Parallel and Distributed Computing, 2019, 125 : 93– 105
24
F Liu, K Keahey, P Riteau, J Weissman. Dynamically negotiating capacity between on-demand and batch clusters. In: Proceedings of International Conference for High Performance Computing, Networking, Storage and Analysis. 2018, 493– 503
25
J Frey, T Tannenbaum, M Livny, I Foster, S Tuecke. Condor-G: a computation management agent for multi-institutional grids. In: Proceedings of the 10th IEEE International Symposium on High Performance Distributed Computing. 2001
26
S Wang, X Zhang, K Yang, L Wang, W Wang. Distributed edge caching scheme considering the tradeoff between the diversity and redundancy of cached content. In: Proceedings of 2015 IEEE/CIC International Conference on Communications in China. 2015, 1– 5
27
D Yuan , Y Yang , X Liu , J Chen . A data placement strategy in scientific cloud workflows. Future Generation Computer Systems, 2010, 26( 8): 1200– 1214
28
J Edinger, D Schäfer, C Krupitzer, V Raychoudhury, C Becker. Fault-avoidance strategies for context-aware schedulers in pervasive computing systems. In: Proceedings of 2017 IEEE International Conference on Pervasive Computing and Communications. 2017, 79– 88
29
Schafer D, Edinger J, Paluska J M, Vansyckel S, Becker C. Tasklets: “better than best-effort” computing. In: Proceedings of the 25th International Conference on Computer Communication and Networks. 2016, 1–11
30
M Breitbach, D Schäfer, J Edinger, C Becker. Context-aware data and task placement in edge computing environments. In: Proceedings of 2019 IEEE International Conference on Pervasive Computing and Communications. 2019, 1– 10
31
T Wang , J Zhou , G Zhang , T Wei , S Hu . Customer perceived value- and risk-aware multiserver configuration for profit maximization. IEEE Transactions on Parallel and Distributed Systems, 2020, 31( 5): 1074– 1088
32
Z Xu, C Stewart, N Deng, X Wang. Blending on-demand and spot instances to lower costs for in-memory storage. In: Proceedings of the 35th Annual IEEE International Conference on Computer Communications. 2016, 1– 9
33
Zheng N, Chen Q, Yang Y, Li J, Zheng W, Guo M. POSTER: precise capacity planning for database public clouds. In: Proceedings of the 28th International Conference on Parallel Architectures and Compilation Techniques. 2019, 457–458
34
V Bharadwaj , D Ghose , T G Robertazzi . Divisible load theory: a new paradigm for load scheduling in distributed systems. Cluster Computing, 2003, 6( 1): 7– 17
35
R McKenna, S Herbein, A Moody, T Gamblin, M Taufer. Machine learning predictions of runtime and IO traffic on high-end clusters. In: Proceedings of 2016 IEEE International Conference on Cluster Computing. 2016, 255– 258
36
H Casanova, A Legrand, M Quinson. SimGrid: a generic framework for large-scale distributed experiments. In: Proceedings of the 10th International Conference on Computer Modeling and Simulation. 2008, 126– 131