Please wait a minute...
Frontiers of Computer Science

ISSN 2095-2228

ISSN 2095-2236(Online)

CN 10-1014/TP

邮发代号 80-970

2019 Impact Factor: 1.275

Frontiers of Computer Science  2021, Vol. 15 Issue (6): 156107   https://doi.org/10.1007/s11704-020-0190-y
  本期目录
User-level failure detection and auto-recovery of parallel programs in HPC systems
Guozhen ZHANG1,2,3, Yi LIU2,3, Hailong YANG1,2,3(), Jun XU4, Depei QIAN2,3
1. State Key Laboratory of Software Development Environment, Beijing 100191, China
2. Sino-German Joint Software Institute, Beihang University, Beijing 100191, China
3. School of Computer Science and Engineering, Beihang University, Beijing 100191, China
4. Science and Technology on Space System Simulation Laboratory Beijing Simulation Center, Beijing 100854, China
 全文: PDF(1027 KB)  
Abstract

As the mean-time-between-failures (MTBF) continues to decline with the increasing number of components on large-scale high performance computing (HPC) systems, program failures might occur during the execution period with high probability. Ensuring successful execution of the HPC programs has become an issue that the unprivileged users should be concerned. From the user perspective, if the program failure cannot be detected and handled in time, it would waste resources and delay the progress of program execution. Unfortunately, the unprivileged users are unable to perform program state checking due to execution control by the job management system as well as the limited privilege. Currently, automated tools for supporting user-level failure detection and autorecovery of parallel programs in HPC systems are missing. This paper proposes an innovative method for the unprivileged user to achieve failure detection of job execution and automatic resubmission of failed jobs. The state checker in our method is encapsulated as an independent job to reduce interference with the user jobs. In addition, we propose a dual-checker mechanism to improve the robustness of our approach.We implement the proposed method as a tool named automatic re-launcher (ARL) and evaluate it on the Tianhe-2 system. Experiment results show that ARL can detect the execution failures effectively on Tianhe-2 system. In addition, the communication and performance overhead caused by ARL is negligible. The good scalability of ARL makes it applicable for large-scale HPC systems.

Key wordshigh performance computing    parallel program    failure detection    failure auto-recovery
收稿日期: 2020-05-10      出版日期: 2021-09-07
Corresponding Author(s): Hailong YANG   
 引用本文:   
. [J]. Frontiers of Computer Science, 2021, 15(6): 156107.
Guozhen ZHANG, Yi LIU, Hailong YANG, Jun XU, Depei QIAN. User-level failure detection and auto-recovery of parallel programs in HPC systems. Front. Comput. Sci., 2021, 15(6): 156107.
 链接本文:  
https://academic.hep.com.cn/fcs/CN/10.1007/s11704-020-0190-y
https://academic.hep.com.cn/fcs/CN/Y2021/V15/I6/156107
1 I P Egwutuoha, D Levy, B Selic, S Chen. A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems. The Journal of Supercomputing, 2013, 65: 1302–1326
https://doi.org/10.1007/s11227-013-0884-0
2 C D Lu. Failure data analysis of hpc systems. 2013, arXiv preprint arXiv:1302.4779
3 F Cappello, A Geist, WD Gropp, L V Kale, WT Kramer, M Snir. Toward exascale resilience. International Journal of High Performance Computing Applications, 2009, 23: 374–385
https://doi.org/10.1177/1094342009347767
4 M Bertier, O Marin, P Sens. Performance analysis of a hierarchical failure detector. In: Proceedings of the 2003 International Conference on Dependable Systems and Networks. 2003, 635–644
5 G R Luecke, Y Zou, J Coyle, J Hoekstra, M Kraeva. Deadlock detection in MPI programs. Concurrency and Computation: Practice and Experience, 2002, 14: 911–932
https://doi.org/10.1002/cpe.701
6 Q Gao, F Qin, D K Panda. DMTracker: finding bugs in large-scale parallel programs by detecting anomaly in data movements. In: Proceedings of the 2007 ACM/IEEE Conference on Supercomputing. 2007, 1–12
https://doi.org/10.1145/1362622.1362643
7 D C Arnold, D H Ahn, B R De Supinski, G L Lee, B P Miller, M Schulz. Stack trace analysis for large-scale debugging. In: Proceddings of the 2007 IEEE International Parallel and Distributed Processing Symposium. 2007, 1–10
https://doi.org/10.1109/IPDPS.2007.370254
8 I Laguna, T Gamblin, B R De Supinski, S Bagchi, G Bronevetsky, D H Anh, M Schulz, B Rountree. Large scale debugging of parallel tasks with AutomaDeD. In: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis. 2011, 1–10
https://doi.org/10.1145/2063384.2063451
9 X Wu, F Mueller. Elastic and scalable tracing and accurate replay of nondeterministic events. In: Proceedings of the 27th International ACM Conference on International Conference on Supercomputing. 2013, 59–68
https://doi.org/10.1145/2464996.2465001
10 R Gupta, P Beckman, B H Park, E Lusk, P Hargrove, A Geist, D Panda, A Lumsdaine, J Dongarra. CIFTS: a coordinated infrastructure for faulttolerant systems. In: Proceedings of the 2009 International Conference on Parallel Processing. 2009, 237–245
https://doi.org/10.1109/ICPP.2009.20
11 G Z Zhang, Y Liu, H L Yang, D P Qian. A lightweight and flexible tool for distinguishing between hardware malfunctions and program bugs in debugging large-scale programs. IEEE Access, 2018, 6: 71892–71905
https://doi.org/10.1109/ACCESS.2018.2882394
12 G Bosilca, A Bouteiller, A Guermouche, T Herault, Y Robert, P Sens, J Dongarra. A failure detector for HPC platforms. The International Journal of High Performance Computing Applications, 2018, 32: 139–158
https://doi.org/10.1177/1094342017711505
13 E Berrocal, L Bautista-Gomez, S Di, Z L Lan, F Cappello. Lightweight silent data corruption detection based on runtime data analysis for HPC applications. In: Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing. 2015, 275–278
https://doi.org/10.1145/2749246.2749253
14 E Berrocal, L Bautista-Gomez, S Di, F Cappello. Exploring partial replication to improve lightweight silent data corruption detection for HPC applications. In: Proceedings of Europe Conference on Parallel Processing. 2016, 419–430
https://doi.org/10.1007/978-3-319-43659-3_31
15 E Berrocal, L Bautista-Gomez, S Di, Z L Lan, F Cappello. Toward general software level silent data corruption detection for parallel applications. IEEE Transactions on Parallel and Distributed Systtems, 2017, 28(12): 3642–3655
https://doi.org/10.1109/TPDS.2017.2735971
16 L Z Guo, D Li, I Laguna, M Schulz. FlipTracker: understanding natural error resilience in HPC applications. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis. 2018, 94–107
https://doi.org/10.1109/SC.2018.00011
17 L B Gomez, F Cappello. Detecting silent data corruption through data dynamic monitoring for scientific applications. In: Proceedings of the 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 2014, 381–382
https://doi.org/10.1145/2692916.2555279
18 O Subasi, S Di, L B Gomez, P Balaprakash, O Unsal, J Labarta, A Cristal, S Krishnamoorthy, F Cappello. Exploring the capabilities of support vector machines in detecting silent data corruptions. Sustainable Computing: Informatics and Systems, 2018, 19: 277–290
https://doi.org/10.1016/j.suscom.2018.01.004
19 J Liu, G Agrawal. Soft error detection for iterative applications using offline training. In: Proceedings of the 23rd IEEE International Conference on High Performance Computing. 2016, 2–11
https://doi.org/10.1109/HiPC.2016.011
20 A Hassani, A Skjellum, R Brightwell. Design and evaluation of FA-MPI, a transactional resilience scheme for non-blocking MPI. In: Proceedings of the 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks. 2014, 750–755
https://doi.org/10.1109/DSN.2014.78
21 Y Li, Z Lan, P Gujrati, X H Sun. Fault-aware runtime strategies for highperformance computing. IEEE Transactions on Parallel and Distributed Systems, 2009, 4(20): 460–473
https://doi.org/10.1109/TPDS.2008.128
22 P H Hargrove, J C Duell. Berkeley lab checkpoint/restart (BLCR) for linux clusters. Journal of Physics: Conference Series, 2006, 46(1): 494–499
https://doi.org/10.1088/1742-6596/46/1/067
23 L B Gomez, S Tsuboi, D Komatitsch, F Cappello, N Maruyama, S Matsuoka. FTI: high performance fault tolerance interface for hybrid systems. In: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis. 2011, 1–12
24 D Buntinas, C Coti, T Herault, P Lemarinier, L Pilard. vs Blocking. nonblocking coordinated checkpointing for large-scale fault tolerant MPI protocols. Future Generation Computer Systems, 2008, 24(1): 73–84
https://doi.org/10.1016/j.future.2007.02.002
25 S Di, M S Bouguerra, L B Gomez, F Cappello. Optimization of multilevel checkpoint model for large scale HPC applications. In: Proceedings of the 28th IEEE International Parallel and Distributed Processing Symposium. 2014, 1181–1190
https://doi.org/10.1109/IPDPS.2014.122
26 J C Y Ho, C L Wang, F C M Lau. Scalable group-based checkpoint/restart for large-scale message-passing systems. In: Proceedings of 2008 IEEE International Symposium on Parallel and Distributed Processing. 2008, 1–12
https://doi.org/10.1109/IPDPS.2008.4536302
27 S Agarwal, R Garg, J E GuptaMS,Moreira. Adaptive incremental checkpointing for massively parallel systems. In: Proceedings of the 18th Annual International Conference on Supercomputing. 2004, 277–286
https://doi.org/10.1145/1006209.1006248
28 B Nicolae, F Cappello. AI-Ckpt: leveraging memory access patterns for adaptive asynchronous incremental checkpointing. In: Proceedings of the 22nd International Symposium on High-Performance Parallel and Distributed Computing. 2013, 155–166
https://doi.org/10.1145/2462902.2462918
29 K B Ferreira, R Riesen, P Bridges, D Arnold, R Brightwell. Accelerating incremental checkpointing for extreme-scale computing. Future Generation Computer Systems, 2014, 30: 66–77
https://doi.org/10.1016/j.future.2013.04.017
30 B Nicolae, A Moody, E Gonsiorowski, K Mohror, F Cappello. VeloC: towards high performance adaptive asynchronous checkpointing at large scale. In: Proceedings of IEEE International Parallel and Distributed Processing Symposium. 2019, 911–920
https://doi.org/10.1109/IPDPS.2019.00099
31 N Losada, G Bosilca, A Bouteiller, P Gonzalez, M J Martin. Local rollback for resilient MPI applications with application-level checkpointing and message logging. Future Generation Computer Systems, 2019, 91: 450–464
https://doi.org/10.1016/j.future.2018.09.041
32 Z L Lan, Y W Li. Adaptive fault management of parallel applications for high performance computing. IEEE Transactions on Computers, 2008, 57: 1647–1660
https://doi.org/10.1109/TC.2008.90
33 D Ibtesham, D Arnold, P G Bridges, K B Ferreira, R Brightwell. On the viability of compression for reducing the overheads of checkpoint/restartbased fault tolerance. In: Proceedings of the 41st International Conference on Parallel Processing. 2012, 148–157
https://doi.org/10.1109/ICPP.2012.45
34 P Zhou, W Liu, L Fei, L Fei, S Lu, F Qin, Y Y Zhou, S Midkiff, G Torrellas. Accmon: automatically detecting memory-related bugs via program counter-based invariants. In: Proceedings of the 37th Annual IEEE/ACM International Symposium on Microarchitecture. 2004, 269–280
35 Z Zheng, Y Li, Z L Lan. Anomaly localization in large-scale clusters. In: Proceedings of 2007 IEEE International Conference on Cluster Computing. 2007, 322–330
https://doi.org/10.1109/CLUSTR.2007.4629246
36 L Yu, Z L Lan. A scalable, non-parametric method for detecting perfor mance anomaly in large scale computing. IEEE Transactions on Parallel and Distributed Systems, 2016, 27(1): 1902–1914
https://doi.org/10.1109/TPDS.2015.2475741
37 Z Zheng, L Yu, W Tang, Z L Lan, R Gupta, N Desai, S Coghlan, D Buettner. Co-analysis of RAS log and job log on Blue Gene/P. In: Proceddings of 2011 IEEE International Parallel & Distributed Processing Symposium. 2011, 840–851
https://doi.org/10.1109/IPDPS.2011.83
38 E Berrocal, L Yu, S Wallace, M E Papka, Z L Lan. Exploring void search for fault detection on extreme scale systems. In: Proceedings of IEEE International Conference on Cluster Computing. 2014, 1–9
https://doi.org/10.1109/CLUSTER.2014.6968757
39 P Gujrati, Y Li, Z L Lan, R Thakur, J White. A meta-learning failure predictor for Blue Gene/L systems. In: Proceedings of 2007 International Conference on Parallel Processing. 2007
https://doi.org/10.1109/ICPP.2007.9
40 Z Zheng, Z L Lan, B H Park, A Geist. System log pre-processing to improve failure prediction. In: Proceedings of 2009 IEEE/IFIP International Conference on Dependable Systems & Networks. 2009, 572–577
https://doi.org/10.1109/DSN.2009.5270289
[1] Article highlights Download
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed