Software approaches for resilience of high performance computing systems: a survey

doi:10.1007/s11704-022-2096-3

Front. Comput. Sci.

2023, Vol. 17

Issue (4) : 174105 https://doi.org/10.1007/s11704-022-2096-3

REVIEW ARTICLE

Software approaches for resilience of high performance computing systems: a survey

Jie JIA^1,²(

), Yi LIU^1,², Guozhen ZHANG^1,², Yulin GAO^1,², Depei QIAN^1,²

¹. School of Computer Science and Engineering, Beihang University, Beijing 100191, China
². Sino-German Joint Software Institute, Beihang University, Beijing 100191, China

Download: PDF(8331 KB) HTML
Export: BibTeX | EndNote | Reference Manager | ProCite | RefWorks

Abstract

With the scaling up of high-performance computing systems in recent years, their reliability has been descending continuously. Therefore, system resilience has been regarded as one of the critical challenges for large-scale HPC systems. Various techniques and systems have been proposed to ensure the correct execution and completion of parallel programs. This paper provides a comprehensive survey of existing software resilience approaches. Firstly, a classification of software resilience approaches is presented; then we introduce major approaches and techniques, including checkpointing, replication, soft error resilience, algorithm-based fault tolerance, fault detection and prediction. In addition, challenges exposed by system-scale and heterogeneous architecture are also discussed.

Keywords resilience high-performance computing fault tolerance challenge

Corresponding Author(s): Jie JIA

Just Accepted Date: 07 September 2022 Issue Date: 12 December 2022

Cite this article:

Jie JIA,Yi LIU,Guozhen ZHANG, et al. Software approaches for resilience of high performance computing systems: a survey[J]. Front. Comput. Sci., 2023, 17(4): 174105.

URL:

https://academic.hep.com.cn/fcs/EN/10.1007/s11704-022-2096-3
https://academic.hep.com.cn/fcs/EN/Y2023/V17/I4/174105

Tab.1 The MTBF of different HPC systems [7]

Tab.2 The terminology of HPC malfunctions

Tab.3 Abnormal states of HPC systems

Tab.4 Classification of typical resilience approaches

Tab.5 Comparison of different checkpointing level

Tab.6 Software solutions for SDC challenge

1	Dongarra J. Report on the fujitsu fugaku system. University of Tennessee-Knoxville Innovative Computing Laboratory, Tech. Rep. ICLUT-20-06, 2020
2	Martino C, Di W, Kramer Z, Kalbarczyk R Iyer . Measuring and understanding extreme-scale application resilience: a field study of 5, 000, 000 HPC application runs. In: Proceedings of the 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks. 2015, 25–36
3	J, Hursey J M, Squyres T I, Mattox A Lumsdaine . The design and implementation of checkpoint/restart process fault Tolerance for open MPI. In: Proceedings of 2007 IEEE International Parallel and Distributed Processing Symposium. 2007, 1–8
4	F, Cappello A, Geist B, Gropp L, Kale B, Kramer M Snir . Toward exascale resilience. The International Journal of High Performance Computing Applications, 2009, 23( 4): 374–388
5	I P, Egwutuoha D, Levy B, Selic S Chen . A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems. The Journal of Supercomputing, 2013, 65( 3): 1302–1326
6	Bergman K, Borkar S, Campbell D, Carlson W, Dally W, Denneau M, Franzon P, Harrod W, Hill K, Hiller J, et al. Exascale computing study: Technology challenges in achieving exascale systems. Defense Advanced Research Projects Agency Information Processing Techniques Office (DARPA IPTO), Tech. Rep, 2008, 15: 181
7	S, Gupta T, Patel C, Engelmann D Tiwari . Failures in large scale systems: Long-term measurement, analysis, and implications. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 2017, 44
8	Radojkovic P, Marazakis M, Carpenter P, Jeyapaul R, Gizopoulos D, Schulz M, Armejach A, Ayguade E A, Bodin F, Canal R, et al. Towards resilient EU HPC systems: A blueprint. PhD thesis, European HPC resilience initiative, 2020
9	A, Avizienis J C, Laprie B, Randell C Landwehr . Basic concepts and taxonomy of dependable and secure computing. IEEE Transactions on Dependable and Secure Computing, 2004, 1( 1): 11–33
10	S Mukherjee . Architecture Design for Soft Errors. San Francisco: Morgan Kaufmann, 2008
11	L, Tan N DeBardeleben . Failure analysis and quantification for contemporary and future supercomputers. 2019, arXiv preprint arXiv: 1911.02118
12	F, Shoji S, Matsui M, Okamoto F, Sueyasu T, Tsukamoto A, Uno K Yamamoto . Long term failure analysis of 10 peta-scale supercomputer. In: Proceedings of HPC in Asia Session at ISC 2015. 2015
13	A, Das F, Mueller C, Siegel A Vishnu . Desh: deep learning for system health prediction of lead times to failure in HPC. In: Proceedings of the 27th International Symposium on High-Performance Parallel and Distributed Computing. 2018, 40–51
14	Martino C, Di Z, Kalbarczyk R K, Iyer F, Baccanico J, Fullop W Kramer . Lessons learned from the analysis of system failures at petascale: the case of blue waters. In: Proceedings of the 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks. 2014, 610–621
15	N, El-Sayed B Schroeder . Reading between the lines of failure logs: understanding how HPC systems fail. In: Proceedings of the 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks. 2013, 1–12
16	Bode B, Butler M, Dunning T, Hoeer T, Kramer W, Gropp W, WenMei H. The blue waters super-system for super-science. In: Contemporary High Performance Computing: From Petascale toward Exascale, 339–366. Chapman and Hall/CRC, 2013
17	B Bland . Titan - Early experience with the titan system at oak ridge national laboratory. In: Proceedings of 2012 SC Companion: High Performance Computing, Networking Storage and Analysis. 2012, 2189–2211
18	L, Bautista-Gomez A, Gainaru S, Perarnau D, Tiwari S, Gupta C, Engelmann F, Cappello M Snir . Reducing waste in extreme scale systems through introspective analysis. In: Proceedings of 2016 IEEE International Parallel and Distributed Processing Symposium. 2016, 212–221
19	D, Tiwari S, Gupta S S Vazhkudai . Lazy checkpointing: exploiting temporal locality in failures to mitigate checkpointing overheads on extreme-scale systems. In: Proceedings of the 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks. 2014, 25–36
20	D, Tiwari S, Gupta G, Gallarno J, Rogers D Maxwell . Reliability lessons learned from GPU experience with the Titan supercomputer at Oak Ridge leadership computing facility. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 2015, 1−12
21	P H, Hargrove J C Duell . Berkeley lab checkpoint/restart (BLCR) for Linux clusters. Journal of Physics: Conference Series, 2006, 46: 494–499
22	Ansel J, Arya K, Cooperman G. DMTCP: Transparent checkpointing for cluster computations and the desktop. In: Proceedings of 2009 IEEE International Symposium on Parallel & Distributed Processing. 2009, 1−12
23	L, Bautista-Gomez S, Tsuboi D, Komatitsch F, Cappello N, Maruyama S Matsuoka . FTI: High performance fault tolerance interface for hybrid systems. In: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis. 2011, 1−12
24	Zhong H, Nieh J. Crak: Linux checkpoint/restart as a kernel module. Technical Report, Citeseer, 2001
25	S, Osman D, Subhraveti G, Su J Nieh . The design and implementation of zap: a system for migrating computing environments. ACM SIGOPS Operating Systems Review, 2002, 36( S1): 361–376
26	S, Sankaran J M, Squyres B, Barrett V, Sahay A, Lumsdaine J, Duell P, Hargrove E Roman . The LAM/MPI checkpoint/restart framework: system-initiated checkpointing. The International Journal of High Performance Computing Applications, 2005, 19( 4): 479–493
27	C, Wang F, Mueller C, Engelmann S L Scott . Hybrid checkpointing for MPI jobs in HPC environments. In: Proceedings of the 16th International Conference on Parallel and Distributed Systems. 2010, 524−533
28	J C, Sancho F, Petrini G, Johnson E Frachtenberg . On the feasibility of incremental checkpointing for scientific computing. In: Proceedings of the 18th International Parallel and Distributed Processing Symposium. 2004, 58
29	S, Agarwal R, Garg M S, Gupta J E Moreira . Adaptive incremental checkpointing for massively parallel systems. In: Proceedings of the 18th Annual International Conference on Supercomputing. 2004, 277−286
30	Bosilca G, Bouteiller A, Cappello F, Djilali S, Fedak G, Germain C, Herault T, Lemarinier P, Lodygensky O, Magniette F, Neri V, Selikhov A. MPICh-V: toward a scalable fault tolerant MPI for volatile nodes. In: Proceedings of 2002 ACM/IEEE Conference on Supercomputing. 2002, 29
31	Bronevetsky G, Marques D, Pingali K, Stodghill P. Automated application-level checkpointing of MPI programs. In: Proceedings of the 9th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 2003, 84−94
32	R L, Graham S E, Choi D J, Daniel N N, Desai R G, Minnich C E, Rasmussen L D, Risinger M W Sukalski . A network-failure-tolerant message-passing system for terascale clusters. International Journal of Parallel Programming, 2003, 31( 4): 285–303
33	N, Woo S, Choi h, Jung J, Moon H Y, Yeom T, Park H Park . MPICH-GF: providing fault tolerance on grid environments. In: Proceedings of the 3rd IEEE//ACM International Symposium on Cluster Computing and the Grid (CCGrid2003), the Poster and Research Demo Session. 2003
34	G, Zheng L, Shi L V Kale . FTC-Charm++: an in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI. In: Proceedings of 2004 IEEE International Conference on Cluster Computing. 2004, 93−103
35	Y, Zhang D, Wong W Zheng . User-level checkpoint and recovery for LAM/MPI. ACM SIGOPS Operating Systems Review, 2005, 39( 3): 72–81
36	D, Buntinas C, Coti T, Herault P, Lemarinier L, Pilard A, Rezmerita E, Rodriguez F Cappello . Blocking vs. non-blocking coordinated checkpointing for large-scale fault tolerant MPI Protocols. Future Generation Computer Systems, 2008, 24( 1): 73–84
37	Ruscio J F, Heffner M A, Varadarajan S. DejaVu: transparent user-level checkpointing, migration, and recovery for distributed systems. In: Proceedings of 2007 IEEE International Parallel and Distributed Processing Symposium. 2007, 1−10
38	J, Cao K, Arya R, Garg S, Matott D K, Panda H, Subramoni J, Vienne G Cooperman . System-level scalable checkpoint-restart for petascale computing. In: Proceedings of the 22nd International Conference on Parallel and Distributed Systems. 2016, 932−941
39	R, Garg G, Price G Cooperman . MANA for MPI: MPI-agnostic network-agnostic transparent checkpointing. In: Proceedings of the 28th International Symposium on High-Performance Parallel and Distributed Computing. 2019, 49−60
40	I, Laguna D F, Richards T, Gamblin M, Schulz Supinski B R, De K, Mohror H Pritchard . Evaluating and extending user-level fault tolerance in MPI applications. The International Journal of High Performance Computing Applications, 2016, 30( 3): 305–319
41	S, Chakraborty I, Laguna M, Emani K, Mohror D K, Panda M, Schulz H Subramoni . EREINIT: scalable and efficient fault-tolerance for bulk-synchronous MPI applications. Concurrency and Computation: Practice and Experience, 2020, 32( 3): e4863
42	G, Georgakoudis L, Guo I Laguna . Reinit++: evaluating the performance of global-restart recovery methods for MPI fault tolerance. In: Proceedings of the 35th International Conference on High Performance Computing. 2020, 536−554
43	G, Bronevetsky D J, Marques K K, Pingali R, Rugina S A McKee . Compiler-enhanced incremental checkpointing for OpenMP applications. In: Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 2008, 275−276
44	R, Arora P, Bangalore M Mernik . A technique for non-invasive application-level checkpointing. The Journal of Supercomputing, 2011, 57( 3): 227–255
45	Ba T N, Arora R. A tool for semi-automatic application-level check- pointing. In: Technical Posters at the International Conference for High Performance Computing, Networking, Storage and Analysis. 2016, 16–20
46	D, Quinlan C Liao . The ROSE source-to-source compiler infrastructure. In: Proceedings of the Cetus Users and Compiler Infrastructure Workshop. 2011, 1−3
47	F, Shahzad J, Thies M, Kreutzer T, Zeiser G, Hager G Wellein . CRAFT: a library for easier application-level checkpoint/restart and automatic fault tolerance. IEEE Transactions on Parallel and Distributed Systems, 2019, 30( 3): 501–514
48	H, Takizawa K, Sato K, Komatsu H Kobayashi . CheCUDA: a checkpoint/restart tool for CUDA applications. In: Proceedings of 2009 International Conference on Parallel and Distributed Computing, Applications and Technologies. 2009, 408−413
49	R Garg . Extending the domain of transparent checkpoint-restart for large-scale HPC. Northeastern University, Dissertation, 2019
50	R, Garg A, Mohan M, Sullivan G Cooperman . CRUM: checkpoint-restart support for CUDA’s unified memory. In: Proceedings of 2018 IEEE International Conference on Cluster Computing. 2018, 302−313
51	T, Jain G Cooperman . CRAC: Checkpoint-restart architecture for CUDA with streams and UVM. In: Proceedings of International Conference for High Performance Computing, Networking, Storage and Analysis. 2020, 1−15
52	K, Lee M B, Sullivan S K S, Hari T, Tsai S W, Keckler M Erez . GPU snapshot: checkpoint offloading for GPU-dense systems. In: Proceedings of the ACM International Conference on Supercomputing. 2019, 171−183
53	S, Kannan N, Farooqui A, Gavrilovska K Schwan . HeteroCheckpoint: efficient checkpointing for accelerator-based systems. In: Proceedings of the 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks. 2014, 738−743
54	N H Vaidya . A case for two-level distributed recovery schemes. In: Proceedings of the 1995 ACM SIGMETRICS Joint International Conference on Measurement and Modeling of Computer Systems. 1995, 64−73
55	J, Haines V, Lakamraju I, Koren C M Krishna . Application-level fault tolerance as a complement to system-level fault tolerance. The Journal of Supercomputing, 2000, 16(1−2): 53−68
56	S, Di Y, Robert F, Vivien F Cappello . Toward an optimal online checkpoint solution under a two-level HPC checkpoint model. IEEE Transactions on Parallel and Distributed Systems, 2017, 28( 1): 244–259
57	A, Benoit A, Cavelan Fèvre V, Le Y, Robert H Sun . Towards optimal multi-level checkpointing. IEEE Transactions on Computers, 2017, 66( 7): 1212–1226
58	K, Ferreira J, Stearley J H, Laros R, Oldfield K, Pedretti R, Brightwell R, Riesen P G, Bridges D Arnold . Evaluating the viability of process replication reliability for exascale systems. In: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis. 2011, 1−12
59	Wu P, Ding C, Chen L, Gao F, Davies T, Karlsson C, Chen Z. Fault tolerant matrix-matrix multiplication: Correcting soft errors on-line. In: Proceedings of the 2nd Workshop on Scalable Algorithms for Large-Scale Systems. 2011, 25−28
60	D, Fiala F, Mueller C, Engelmann R, Riesen K, Ferreira R Brightwell . Detection and correction of silent data corruption for large-scale high-performance computing. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. 2012, 1−12
61	Z, Wang X, Yang Y Zhou . MMPI: a scalable fault tolerance mechanism for MPI large scale parallel computing. In: Proceedings of the 10th IEEE International Conference on Computer and Information Technology. 2010, 1251−1256
62	Z, Hussain T, Znati R Melhem . Partial redundancy in HPC systems with non-uniform node reliabilities. In: Proceedings of International Conference for High Performance Computing, Networking, Storage and Analysis. 2018, 566−576
63	J, Elliott K, Kharbas D, Fiala F, Mueller K, Ferreira C Engelmann . Combining partial redundancy and checkpointing for HPC. In: Proceedings of the 32nd International Conference on Distributed Computing Systems. 2012, 615−626
64	C, George S Vadhiyar . Fault tolerance on large scale systems using adaptive process replication. IEEE Transactions on Computers, 2015, 64( 8): 2213–2225
65	H, Quinn P Graham . Terrestrial-based radiation upsets: a cautionary tale. In: Proceedings of the 13th Annual IEEE Symposium on Field-Programmable Custom Computing Machines. 2005, 193−202
66	B, Schroeder E, Pinheiro W D Weber . DRAM errors in the wild: a large-scale field study. Communications of the ACM, 2011, 54( 2): 100–107
67	Y, Sedaghat S G, Miremadi M Fazeli . A software-based error detection technique using encoded signatures. In: Proceedings of the 21st IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems. 2006, 389−400
68	G, Miremadi J, Harlsson U, Gunneflo J Torin . Two software techniques for on-line error detection. In: Proceedings of the 22nd International Symposium on Fault-Tolerant Computing. 1992, 328−335
69	R, Vemu J Abraham . CEDA: control-flow error detection using assertions. IEEE Transactions on Computers, 2011, 60( 9): 1233–1245
70	H R, Zarandi M, Maghsoudloo N Khoshavi . Two efficient software techniques to detect and correct control-flow errors. In: Proceedings of the 16th Pacific Rim International Symposium on Dependable Computing. 2010, 141−148
71	L B, Gomez F Cappello . Detecting silent data corruption through data dynamic monitoring for scientific applications. ACM SIGPLAN Notices, 2014, 49( 8): 381–382
72	E, Berrocal L, Bautista-Gomez S, Di Z, Lan F Cappello . Lightweight silent data corruption detection based on runtime data analysis for HPC applications. In: Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing. 2015, 275−278
73	T, LeBlanc R, Anand E, Gabriel J Subhlok . VolpexMPI: an MPI library for execution of parallel applications on volatile nodes. In: Proceedings of the 16th European Parallel Virtual Machine / Message Passing Interface Users’ Group Meeting. 2009, 124−133
74	C, Engelmann S Boehm . Redundant execution of HPC applications with MR-MPI. In: Proceedings of the 10th IASTED International Conference on Parallel and Distributed Computing and Networks. 2011, 31−38
75	E, Berrocal L, Bautista-Gomez S, Di Z, Lan F Cappello . Toward general software level silent data corruption detection for parallel applications. IEEE Transactions on Parallel and Distributed Systems, 2017, 28( 12): 3642–3655
76	D, Fiala K B, Ferreira F, Mueller C Engelmann . A tunable, software-based DRAM error detection and correction library for HPC. In: Proceedings of European Conference on Parallel Processing. 2012, 251−261
77	D, Fiala F, Mueller K B Ferreira . FlipSphere: a software-based DRAM error detection and correction library for HPC. In: Proceedings of the 20th International Symposium on Distributed Simulation and Real Time Applications. 2016, 19−28
78	Fiala D, Mueller F, Ferreira K, Engelmann C. Mini-Ckpts: surviving OS failures in persistent memory. In: Proceedings of 2016 International Conference on Supercomputing. 2016, 7
79	K H, Huang J A Abraham . Algorithm-based fault tolerance for matrix operations. IEEE Transactions on Computers, 1984, C-33( 6): 518–528
80	F T, Luk H Park . Fault-tolerant matrix triangularizations on systolic arrays. IEEE Transactions on Computers, 1988, 37( 11): 1434–1438
81	F T, Luk H Park . An analysis of algorithm-based fault tolerance techniques. Journal of Parallel and Distributed Computing, 1988, 5( 2): 172–184
82	A, Bouteiller T, Herault G, Bosilca P, Du J Dongarra . Algorithm-based fault tolerance for dense matrix factorizations, multiple failures and accuracy. ACM Transactions on Parallel Computing, 2015, 1( 2): 10
83	Z Chen . Online-ABFT: an online algorithm based fault tolerance scheme for soft error detection in iterative methods. ACM SIGPLAN Notices, 2013, 48( 8): 167–176
84	D, Tao S L, Song S, Krishnamoorthy P, Wu X, Liang E Z, Zhang D, Kerbyson Z Chen . New-sum: a novel online ABFT scheme for general iterative methods. In: Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing. 2016, 43−55
85	A, Schöll C, Braun M A, Kochte H J Wunderlich . Efficient algorithm-based fault tolerance for sparse matrix operations. In: Proceedings of the 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks. 2016, 251−262
86	M, Shantharam S, Srinivasmurthy P Raghavan . Fault tolerant preconditioned conjugate gradient for sparse linear system solution. In: Proceedings of the 26th ACM International Conference on Supercomputing. 2012, 69−78
87	Y, Zhu Y, Liu M, Li D Qian . Block-checksum-based fault tolerance for matrix multiplication on large-scale parallel systems. In: Proceedings of the 20th International Conference on High Performance Computing and Communications; IEEE 16th International Conference on Smart City; IEEE 4th International Conference on Data Science and Systems. 2018, 172−179
88	Y, Zhu Y, Liu G Zhang . FT-PBLAS: PBLAS-based fault-tolerant linear algebra computation on high-performance computing systems. IEEE Access, 2020, 8: 42674–42688
89	Z, Chen J Dongarra . Algorithm-based fault tolerance for fail-stop failures. IEEE Transactions on Parallel and Distributed Systems, 2008, 19( 12): 1628–1641
90	T, Roche M, Cunche J L Roch . Algorithm-based fault tolerance applied to P2P computing networks. In: Proceedings of the 1st International Conference on Advances in P2P Systems. 2009, 144−149
91	D, Hakkarinen P, Wu Z Chen . Fail-stop failure algorithm-based fault tolerance for Cholesky decomposition. IEEE Transactions on Parallel and Distributed Systems, 2015, 26( 5): 1323–1335
92	T, Davies C, Karlsson H, Liu C, Ding Z Chen . High performance linpack benchmark: a fault tolerant implementation without checkpointing. In: Proceedings of the International Conference on Supercomputing. 2011, 162−171
93	J, Chen S, Li Z Chen . GPU-ABFT: optimizing algorithm-based fault tolerance for heterogeneous systems with GPUs. In: Proceedings of 2016 IEEE International Conference on Networking, Architecture and Storage. 2016, 1−2
94	J, Chen H, Li S, Li X, Liang P, Wu D, Tao K, Ouyang Y, Liu K, Zhao Q, Guan Z Chen . Fault tolerant one-sided matrix decompositions on heterogeneous systems with GPUs. In: Proceedings of International Conference for High Performance Computing, Networking, Storage and Analysis. 2018, 854−865
95	C, Braun S, Halder H J Wunderlich . A-ABFT: autonomous algorithm-based fault tolerance for matrix multiplications on graphics processing units. In: Proceedings of the 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks. 2014, 443−454
96	S, Ranganathan A D, George R W, Todd M C Chidester . Gossip-style failure detection and distributed consensus for scalable heterogeneous clusters. Cluster Computing, 2001, 4( 3): 197–209
97	M, Gabel A, Schuster R G, Bachrach N Bjørner . Latent fault detection in large scale services. In: Proceedings of the IEEE/IFIP International Conference on Dependable Systems and Networks. 2012, 1−12
98	L, Wu H, Luo J, Zhan D Meng . A runtime fault detection method for HPC cluster. In: Proceedings of the 12th International Conference on Parallel and Distributed Computing, Applications and Technologies. 2011, 68−72
99	S, Ghiasvand F M Ciorba . Anomaly detection in high performance computers: a vicinity perspective. In: Proceedings of the 18th International Symposium on Parallel and Distributed Computing. 2019, 112−120
100	I P, Egwutuoha S, Chen D, Levy B, Selic R Calvo . Cost-oriented proactive fault tolerance approach to high performance computing (HPC) in the cloud. International Journal of Parallel, Emergent and Distributed Systems, 2014, 29( 4): 363–378
101	A, Borghesi A, Libri L, Benini A Bartolini . Online anomaly detection in HPC systems. In: Proceedings of 2019 IEEE International Conference on Artificial Intelligence Circuits and Systems. 2019, 229−233
102	A, Borghesi M, Molan M, Milano A Bartolini . Anomaly detection and anticipation in high performance computing systems. IEEE Transactions on Parallel and Distributed Systems, 2022, 33( 4): 739–750
103	M C, Dani H, Doreau S Alt . K-means application for anomaly detection and log classification in HPC. In: Proceedings of the 30th International Conference on Industrial, Engineering and Other Applications of Applied Intelligent Systems. 2017, 201−210
104	B, Zhu G, Wang X, Liu D, Hu S, Lin J Ma . Proactive drive failure prediction for large scale storage systems. In: Proceedings of the 29th Symposium on Mass Storage Systems and Technologies. 2013, 1−5
105	Fulp E W, Fink G A, Haack J N. Predicting computer system failures using support vector machines. In: Proceedings of the 1st USENIX Conference on Analysis of System Logs. 2008, 5
106	S, Ganguly A, Consul A, Khan B, Bussone J, Richards A Miguel . A practical approach to hard disk failure prediction in cloud platforms: big data model for failure management in datacenters. In: Proceedings of the 2nd International Conference on Big Data Computing Service and Applications. 2016, 105−116
107	B, Krammer K, Bidmon M S, Müller M M Resch . MARMOT: an MPI analysis and checking tool. Advances in Parallel Computing, 2004, 13: 493–500
108	Vetter J S, De Supinski B R. Dynamic software testing of MPI applications with Umpire. In: Proceedings of 2000 ACM/IEEE Conference on Supercomputing. 2000, 51
109	J, Gao K, Yu P Qing . A scalable runtime fault detection mechanism for high performance computing. In: Proceedings of the 2nd Information Technology, Networking, Electronic and Automation Control Conference. 2017, 490−495
110	K, Kharbas D, Kim T, Hoefler F Mueller . Assessing HPC failure detectors for MPI jobs. In: Proceedings of the 20th Euromicro International Conference on Parallel, Distributed and Network-Based Processing. 2012, 81−88
111	Y, Liang Y, Zhang A, Sivasubramaniam M, Jette R Sahoo . BlueGene/L failure analysis and prediction models. In: Proceedings of the International Conference on Dependable Systems and Networks. 2006, 425−434
112	A, Gainaru F, Cappello M, Snir W Kramer . Fault prediction under the microscope: a closer look into HPC systems. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. 2012, 1−11
113	Gainaru A, Cappello F, Kramer W. Taming of the shrew: modeling the normal and faulty behaviour of large-scale HPC systems. In: Proceedings of the 26th International Parallel and Distributed Processing Symposium. 2012, 1168−1179
114	A, Pelaez A, Quiroz J C, Browne E, Chuah M Parashar . Online failure prediction for HPC resources using decentralized clustering. In: Proceedings of the 21st International Conference on High Performance Computing. 2014, 1−9
115	H S, Gunawi R O, Suminto R, Sears C, Golliher S, Sundararaman , et al.. Fail-slow at scale: evidence of hardware performance faults in large production systems. In: Proceedings of the 16th USENIX Conference on File and Storage Technologies. 2018, 1−14

[1]

FCS-22096-OF-JJ_suppl_1

Download

[1]	Rong ZENG, Xiaofeng HOU, Lu ZHANG, Chao LI, Wenli ZHENG, Minyi GUO. Performance optimization for cloud computing systems in the microservice era: state-of-the-art and research opportunities[J]. Front. Comput. Sci., 2022, 16(6): 166106-.
[2]	Yuejun LIU, Yongbin ZHOU, Rui ZHANG, Yang TAO. (Full) Leakage resilience of Fiat-Shamir signatures over lattices[J]. Front. Comput. Sci., 2022, 16(5): 165819-.
[3]	Yao SONG, Limin XIAO, Liang WANG, Guangjun QIN, Bing WEI, Baicheng YAN, Chenhao ZHANG. GCSS: a global collaborative scheduling strategy for wide-area high-performance computing[J]. Front. Comput. Sci., 2022, 16(5): 165105-.
[4]	Zhengxiong HOU, Hong SHEN, Xingshe ZHOU, Jianhua GU, Yunlan WANG, Tianhai ZHAO. Prediction of job characteristics for intelligent resource allocation in HPC systems: a survey and future directions[J]. Front. Comput. Sci., 2022, 16(5): 165107-.
[5]	Xiaotong WANG, Chunxi ZHANG, Junhua FANG, Rong ZHANG, Weining QIAN, Aoying ZHOU. A comprehensive study on fault tolerance in stream processing systems[J]. Front. Comput. Sci., 2022, 16(2): 162603-.
[6]	Yanwei ZHOU, Bo YANG. Practical continuous leakage-resilient CCA secure identity-based encryption[J]. Front. Comput. Sci., 2020, 14(4): 144804-.
[7]	Tao ZHU, Huiqi HU, Weining QIAN, Huan ZHOU, Aoying ZHOU. Fault-tolerant precise data access on distributed log-structured merge-tree[J]. Front. Comput. Sci., 2019, 13(4): 760-777.
[8]	Dan HAO,Lu ZHANG,Hong MEI. Test-case prioritization: achievements and challenges[J]. Front. Comput. Sci., 2016, 10(5): 769-777.
[9]	Mingwu ZHANG,Yi MU. Key continual-leakage resilient broadcast cryptosystem from dual system in broadcast networks[J]. Front. Comput. Sci., 2014, 8(3): 456-468.
[10]	Qiqi LAI,Yuan CHEN,Yupu HU,Baocang WANG,Mingming JIANG. Construction of a key-dependent message secure symmetric encryption scheme in the ideal cipher model[J]. Front. Comput. Sci., 2014, 8(3): 469-477.
[11]	Xuejun YANG, Xiangke LIAO, Weixia XU, Junqiang SONG, Qingfeng HU, Jinshu SU, Liquan XIAO, Kai LU, Qiang DOU, Juping JIANG, Canqun YANG, . TH-1: China’s first petaflop supercomputer[J]. Front. Comput. Sci., 2010, 4(4): 445-455.
[12]	WANG Yuanzhuo, LIN Chuang, YANG Yang, SHAN Zhiguang. Performance analysis of a dependable scheduling strategy based on a fault-tolerant grid model[J]. Front. Comput. Sci., 2007, 1(3): 329-337.

Viewed

Full text

Abstract

Cited

Shared

Discussed