Please wait a minute...
Frontiers of Computer Science

ISSN 2095-2228

ISSN 2095-2236(Online)

CN 10-1014/TP

邮发代号 80-970

2019 Impact Factor: 1.275

Frontiers of Computer Science  2022, Vol. 16 Issue (5): 165105   https://doi.org/10.1007/s11704-021-0353-5
  本期目录
GCSS: a global collaborative scheduling strategy for wide-area high-performance computing
Yao SONG1,2, Limin XIAO1,2(), Liang WANG2, Guangjun QIN3(), Bing WEI1,2, Baicheng YAN1,2, Chenhao ZHANG1,2
1. State Key Laboratory of Software Development Environment, Beihang University, Beijing 100191, China
2. School of Computer Science and Engineering, Beihang University, Beijing 100191, China
3. Smart City College, Beijing Union University, Beijing 100101, China
 全文: PDF(15795 KB)   HTML
Abstract

Wide-area high-performance computing is widely used for large-scale parallel computing applications owing to its high computing and storage resources. However, the geographical distribution of computing and storage resources makes efficient task distribution and data placement more challenging. To achieve a higher system performance, this study proposes a two-level global collaborative scheduling strategy for wide-area high-performance computing environments. The collaborative scheduling strategy integrates lightweight solution selection, redundant data placement and task stealing mechanisms, optimizing task distribution and data placement to achieve efficient computing in wide-area environments. The experimental results indicate that compared with the state-of-the-art collaborative scheduling algorithm HPS+, the proposed scheduling strategy reduces the makespan by 23.24%, improves computing and storage resource utilization by 8.28% and 21.73% respectively, and achieves similar global data migration costs.

Key wordshigh-performance computing    scheduling strategy    task scheduling    data placement
收稿日期: 2020-07-16      出版日期: 2021-12-31
Corresponding Author(s): Limin XIAO,Guangjun QIN   
 引用本文:   
. [J]. Frontiers of Computer Science, 2022, 16(5): 165105.
Yao SONG, Limin XIAO, Liang WANG, Guangjun QIN, Bing WEI, Baicheng YAN, Chenhao ZHANG. GCSS: a global collaborative scheduling strategy for wide-area high-performance computing. Front. Comput. Sci., 2022, 16(5): 165105.
 链接本文:  
https://academic.hep.com.cn/fcs/CN/10.1007/s11704-021-0353-5
https://academic.hep.com.cn/fcs/CN/Y2022/V16/I5/165105
Fig.1  
Parameter Description
Ak application requirements of task k
Bi,j,k data migration bandwidth for task k in solution (i,j)
Cj total cores in center j
Cavailj number of available cores in center j
Ck cores required by task k
CCj average computing capacity of each core in center j
Fk task stealing value of task k
Ik source data center set for task k
Jk target data center set for task k
Pj,k data placement value of data for task k in center j
Sj total storage resources in center j
Sremainj idle storage resources in center j
Sk storage resource requirements of task k
Ttransi,j,k estimated data migration time for task k in solution (i,j)
Twaitj,k estimated waiting time of task k in center j
Tcompj,k estimated computing time of task k in center j
Tstartj,k computing start time of task k in center j
Ti,j,k completion time of task k in solution (i,j)
Wcompk estimated computing quantity for task k
fj,k accessed proportion of data for task k in center j
tj,k access time interval of data for task k in center j
priorityk priority of task k
?j,k resource relevance parameter representing task k's requirements and resource status in center j
Tab.1  
Fig.2  
Fig.3  
Data center Core number Core computing capacity/GFlops
DC1 3500 22.89
DC2 3200 29.43
DC3 2600 14.55
DC4 2700 22.89
DC5 2800 14.55
Tab.2  
Fig.4  
Name Starting Time Max core requirement
CTC-SP2 1996.03.01 308
SDSC-DS 2004.02.01 1360
CEA-Curie 2011.02.01 2560
Tab.3  
HPS+ CADT CAMS DSS GCSS
Task scheduling ? ? ? ? ?
Data placement ? ? ? × ?
Data replica × × ? × ?
Bandwidth allocation ? × × × ×
Tab.4  
Fig.5  
Trace Algorithm Makespan of different input task numbers/s
500 1000 1500 2000 2500
CTC GCSS 23294.70 27072.36 27072.36 32064.36 52278.23
HPS+ 23483.83 28018.56 32680.13 45532.16 68102.33
CADT 23534.31 28370.69 51680.07 74013.17 91357.67
CAMS 60032.69 93366.80 106422.44 132500.95 168278.29
DSS 63621.97 144647.44 229714.13 307202.79 381714.16
FIFO 145525.05 270427.54 388680.07 496644.02 594694.38
SDSC GCSS 45002.67 75000.53 104094.65 122001.59 135672.46
HPS+ 60037.27 91149.37 112037.86 145001.99 168628.91
CADT 55040.86 99149.06 112024.88 149003.02 187038.23
CAMS 73082.97 128001.85 166037.69 228001.95 260335.96
DSS 53000.30 134015.11 237012.59 292500.75 329500.84
FIFO 155052.81 322680.00 490024.14 599009.51 695003.40
CEA GCSS 16379.88 32403.52 83210.37 106435.17 189821.53
HPS+ 45601.52 85010.38 220032.02 290035.17 500110.57
CADT 35515.36 69402.31 129217.18 161435.17 258602.19
CAMS 64382.38 118003.64 254020.83 328030.54 522000.20
DSS 74500.41 213500.23 321000.23 382503.66 550500.83
FIFO 137594.62 329010.83 490018.56 603030.52 842013.76
Tab.5  
Fig.6  
Fig.7  
Algorithm Global data migration cost(s)
CTC SDSC CEA
GCSS 1966350 1855279 551576
HPS+ 641278 4566047 403164
CADT 2988213 3744135 2424097
CAMS 4141334 5900437 3544304
DSS 17510440 7043739 5372566
FIFO 22533680 24543810 14834070
Tab.6  
Fig.8  
Fig.9  
1 J Towns , T Cockerill , M Dahan , I Foster , K Gaither , A Grimshaw , V Hazlewood , S Lathrop , D Lifka , G D Peterson , R Roskies , J R Scott , N Wilkins-Diehr . XSEDE: accelerating scientific discovery. Computing in Science & Engineering, 2014, 16( 5): 62– 74
2 X Xie, N Xiao, Z Xu, L Zha, W Li, H Yu. CNGrid software 2: service oriented approach to grid computing. In: Proceedings of the UK e-Science All Hands Meeting. 2005, 701– 708
3 W C Skamarock, J B Klemp, J Dudhia, D O Gill, J G Powers. A description of the Advanced Research WRF version 2. NCAR/TN-468+STR. Boulder: National Center for Atmospheric Research, 2005
4 T Kosar , M Balman . A new paradigm: data-aware scheduling in grid computing. Future Generation Computer Systems, 2009, 25( 4): 406– 413
5 M Chowdhury , M Zaharia , J Ma , M I Jordan , I Stoica . Managing data transfers in computer clusters with orchestra. ACM SIGCOMM Computer Communication Review, 2011, 41( 4): 98– 109
6 K Wang , K Qiao , I Sadooghi , X Zhou , T Li , M Lang , I Raicu . Load-balanced and locality-aware scheduling for data-intensive workloads at extreme scales. Concurrency and Computation: Practice and Experience, 2016, 28( 1): 70– 94
7 S Kang , B Veeravalli , K M M Aung . Dynamic scheduling strategy with efficient node availability prediction for handling divisible loads in multi-cloud systems. Journal of Parallel and Distributed Computing, 2018, 113 : 1– 16
8 W Wei, B Li, B Liang, J Li. Multi-resource fair sharing for datacenter jobs with placement constraints. In: Proceedings of International Conference for High Performance Computing, Networking, Storage and Analysis. 2016, 1003−1014
9 T Buddhika , R Stern , K Lindburg , K Ericson , S Pallickara . Online scheduling and interference alleviation for low-latency, high-throughput processing of data streams. IEEE Transactions on Parallel and Distributed Systems, 2017, 28( 12): 3553– 3569
10 N Kremer-Herman, B Tovar, D Thain. A lightweight model for right-sizing master-worker applications. In: Proceedings of International Conference for High Performance Computing, Networking, Storage and Analysis. 2018, 504– 516
11 E Gaussier , J Lelong , V Reis , D Trystram . Online tuning of EASY-backfilling using queue reordering policies. IEEE Transactions on Parallel and Distributed Systems, 2018, 29( 10): 2304– 2316
12 D Carastan-Santos, R Y De Camargo. Obtaining dynamic scheduling policies with simulation and machine learning. In: Proceedings of International Conference for High Performance Computing, Networking, Storage and Analysis. 2017, 32
13 C Y Chen . Task scheduling for maximizing performance and reliability considering fault recovery in heterogeneous distributed systems. IEEE Transactions on Parallel and Distributed Systems, 2016, 27( 2): 521– 532
14 H Xu , W C Lau . Optimization for speculative execution in big data processing clusters. IEEE Transactions on Parallel and Distributed Systems, 2017, 28( 2): 530– 545
15 S He , Y Wang , X Sun . Boosting parallel file system performance via heterogeneity-aware selective data layout. IEEE Transactions on Parallel and Distributed Systems, 2016, 27( 9): 2492– 2505
16 D G Cameron, R Carvajal-Schiaffino, A P Millar, C Nicholson, K Stockinger, F Zini. Evaluating scheduling and replica optimisation strategies in OptorSim. In: Proceedings of the 1st Latin American Web Congress. 2003, 52– 59
17 P Bryk , M Malawski , G Juve , E Deelman . Storage-aware algorithms for scheduling of workflow ensembles in clouds. Journal of Grid Computing, 2016, 14( 2): 359– 378
18 Mon E E, Thein M M, Aung M T. Clustering based on task dependency for data-intensive workflow scheduling optimization. In: Proceedings of the 9th Workshop on Many-Task Computing on Clouds, Grids, and Supercomputers. 2016, 20–25
19 C Szabo , Q Z Sheng , T Kroeger , Y Zhang , J Yu . Science in the cloud: allocation and execution of data-intensive scientific workflows. Journal of Grid Computing, 2014, 12( 2): 245– 264
20 L Zhao , Y Yang , A Munir , A X Liu , W Qu . Optimizing geo-distributed data analytics with coordinated task scheduling and routing. IEEE Transactions on Parallel and Distributed Systems, 2020, 31( 2): 279– 293
21 S Wang , W Chen , X Zhou , L Zhang , Y Wang . Dependency-aware network adaptive scheduling of data-intensive parallel jobs. IEEE Transactions on Parallel and Distributed Systems, 2019, 30( 3): 515– 529
22 X Wei , L Li , X Li , X Wang , S Gao , H Li . Pec: proactive elastic collaborative resource scheduling in data stream processing. IEEE Transactions on Parallel and Distributed Systems, 2019, 30( 7): 1628– 1642
23 C Li , J Bai , J Tang . Joint optimization of data placement and scheduling for improving user experience in edge computing. Journal of Parallel and Distributed Computing, 2019, 125 : 93– 105
24 F Liu, K Keahey, P Riteau, J Weissman. Dynamically negotiating capacity between on-demand and batch clusters. In: Proceedings of International Conference for High Performance Computing, Networking, Storage and Analysis. 2018, 493– 503
25 J Frey, T Tannenbaum, M Livny, I Foster, S Tuecke. Condor-G: a computation management agent for multi-institutional grids. In: Proceedings of the 10th IEEE International Symposium on High Performance Distributed Computing. 2001
26 S Wang, X Zhang, K Yang, L Wang, W Wang. Distributed edge caching scheme considering the tradeoff between the diversity and redundancy of cached content. In: Proceedings of 2015 IEEE/CIC International Conference on Communications in China. 2015, 1– 5
27 D Yuan , Y Yang , X Liu , J Chen . A data placement strategy in scientific cloud workflows. Future Generation Computer Systems, 2010, 26( 8): 1200– 1214
28 J Edinger, D Schäfer, C Krupitzer, V Raychoudhury, C Becker. Fault-avoidance strategies for context-aware schedulers in pervasive computing systems. In: Proceedings of 2017 IEEE International Conference on Pervasive Computing and Communications. 2017, 79– 88
29 Schafer D, Edinger J, Paluska J M, Vansyckel S, Becker C. Tasklets: “better than best-effort” computing. In: Proceedings of the 25th International Conference on Computer Communication and Networks. 2016, 1–11
30 M Breitbach, D Schäfer, J Edinger, C Becker. Context-aware data and task placement in edge computing environments. In: Proceedings of 2019 IEEE International Conference on Pervasive Computing and Communications. 2019, 1– 10
31 T Wang , J Zhou , G Zhang , T Wei , S Hu . Customer perceived value- and risk-aware multiserver configuration for profit maximization. IEEE Transactions on Parallel and Distributed Systems, 2020, 31( 5): 1074– 1088
32 Z Xu, C Stewart, N Deng, X Wang. Blending on-demand and spot instances to lower costs for in-memory storage. In: Proceedings of the 35th Annual IEEE International Conference on Computer Communications. 2016, 1– 9
33 Zheng N, Chen Q, Yang Y, Li J, Zheng W, Guo M. POSTER: precise capacity planning for database public clouds. In: Proceedings of the 28th International Conference on Parallel Architectures and Compilation Techniques. 2019, 457–458
34 V Bharadwaj , D Ghose , T G Robertazzi . Divisible load theory: a new paradigm for load scheduling in distributed systems. Cluster Computing, 2003, 6( 1): 7– 17
35 R McKenna, S Herbein, A Moody, T Gamblin, M Taufer. Machine learning predictions of runtime and IO traffic on high-end clusters. In: Proceedings of 2016 IEEE International Conference on Cluster Computing. 2016, 255– 258
36 H Casanova, A Legrand, M Quinson. SimGrid: a generic framework for large-scale distributed experiments. In: Proceedings of the 10th International Conference on Computer Modeling and Simulation. 2008, 126– 131
[1] Highlights Download
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed