Please wait a minute...
Frontiers of Computer Science

ISSN 2095-2228

ISSN 2095-2236(Online)

CN 10-1014/TP

Postal Subscription Code 80-970

2018 Impact Factor: 1.129

Front. Comput. Sci.    2023, Vol. 17 Issue (4) : 174105    https://doi.org/10.1007/s11704-022-2096-3
REVIEW ARTICLE
Software approaches for resilience of high performance computing systems: a survey
Jie JIA1,2(), Yi LIU1,2, Guozhen ZHANG1,2, Yulin GAO1,2, Depei QIAN1,2
1. School of Computer Science and Engineering, Beihang University, Beijing 100191, China
2. Sino-German Joint Software Institute, Beihang University, Beijing 100191, China
 Download: PDF(8331 KB)   HTML
 Export: BibTeX | EndNote | Reference Manager | ProCite | RefWorks
Abstract

With the scaling up of high-performance computing systems in recent years, their reliability has been descending continuously. Therefore, system resilience has been regarded as one of the critical challenges for large-scale HPC systems. Various techniques and systems have been proposed to ensure the correct execution and completion of parallel programs. This paper provides a comprehensive survey of existing software resilience approaches. Firstly, a classification of software resilience approaches is presented; then we introduce major approaches and techniques, including checkpointing, replication, soft error resilience, algorithm-based fault tolerance, fault detection and prediction. In addition, challenges exposed by system-scale and heterogeneous architecture are also discussed.

Keywords resilience      high-performance computing      fault tolerance      challenge     
Corresponding Author(s): Jie JIA   
Just Accepted Date: 07 September 2022   Issue Date: 12 December 2022
 Cite this article:   
Jie JIA,Yi LIU,Guozhen ZHANG, et al. Software approaches for resilience of high performance computing systems: a survey[J]. Front. Comput. Sci., 2023, 17(4): 174105.
 URL:  
https://academic.hep.com.cn/fcs/EN/10.1007/s11704-022-2096-3
https://academic.hep.com.cn/fcs/EN/Y2023/V17/I4/174105
System MTBF(hours) Cores Nodes
Jaguar XT4 36.91 31,328 7,832
Jaguar XT5 22.67 149,504 18,688
Jaguar XK5 8.93 298,592 18,688
Titan XK7 14.51 560,640 18,688
Tab.1  The MTBF of different HPC systems [7]
Concept Explaining
Fault Root cause of an error, usually physical defects or software bugs
Error Deviation from the expected result
Failure Fails to deliver correct service
Tab.2  The terminology of HPC malfunctions
Class Meaning Typical examples Explanation
Fail-stop failure Hardware and/or software stop working Kernel panic Kernel error from which the operating system cannot quickly recover.
Node heartbeat fault Exception when accepting the heartbeat from other nodes.
Traps Segmentation faults, trap invalid opcode.
GFS failure Failure of the global file system.
Scheduler Internal bugs of job scheduler.
Acc failure Failure of accelerators or co-processors.
Storage failure Storage system fails to work.
Node hardware failure Node fails due to power/cooling-system error, damage of hardware components, etc.
Interconnect conjunction Network connection is congested.
Soft error / Fail-continue error System still works but the execution of application incorrect SDC Undetected silent data corruption.
CFE Control flow error.
MCE Memory check exception.
Tab.3  Abnormal states of HPC systems
Resilience method Checkpointing Replication Soft error resilience ABFT Fault detection and prediction
Redundancy data System memory or application data space Process data and message N/A Checksum of algorithm N/A
Recovery method Failure-rollback Forward recovery Error-restart Error-restart N/A
Overhead/cost Medium High Medium Low Low
Generality Systems and applications Systems and applications Systems and applications Applications Systems and applications
Ease of use or deployment Easy Easy Hard Hard Medium
Limitation Scalability Resource consumption and scalability Soft error only Algorithm-dependent Rely on other recovery methods
Tab.4  Classification of typical resilience approaches
Checkpointing level System-level User-level Application-level
Explanation Operating system in charge of checkpointing. A user-level library is responsible for checkpointing and links to applications The application itself is in charge of checkpointing.
Typical systems BLCR [21] DMTCP [22] FTI [23]
Checkpointing data Status of entire system Status of entire application user-specified application status
Overhead High Medium Low
Transparency Transparent to applications Application needs to be loaded or linked with checkpoint library Application needs to be modified
Portability Low Medium High
Tab.5  Comparison of different checkpointing level
Approach Advantages Disadvantages
Checkpoint/restart No hardware features required, less or no program modification Requires large storage space and high time overhead
Replication Simple and straightforward High overhead, including running time, computing resources
ABFT Low-overhead Required program code modifications, and poor portability
Tab.6  Software solutions for SDC challenge
  
  
  
  
  
1 Dongarra J. Report on the fujitsu fugaku system. University of Tennessee-Knoxville Innovative Computing Laboratory, Tech. Rep. ICLUT-20-06, 2020
2 Martino C, Di W, Kramer Z, Kalbarczyk R Iyer . Measuring and understanding extreme-scale application resilience: a field study of 5, 000, 000 HPC application runs. In: Proceedings of the 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks. 2015, 25–36
3 J, Hursey J M, Squyres T I, Mattox A Lumsdaine . The design and implementation of checkpoint/restart process fault Tolerance for open MPI. In: Proceedings of 2007 IEEE International Parallel and Distributed Processing Symposium. 2007, 1–8
4 F, Cappello A, Geist B, Gropp L, Kale B, Kramer M Snir . Toward exascale resilience. The International Journal of High Performance Computing Applications, 2009, 23( 4): 374–388
5 I P, Egwutuoha D, Levy B, Selic S Chen . A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems. The Journal of Supercomputing, 2013, 65( 3): 1302–1326
6 Bergman K, Borkar S, Campbell D, Carlson W, Dally W, Denneau M, Franzon P, Harrod W, Hill K, Hiller J, et al. Exascale computing study: Technology challenges in achieving exascale systems. Defense Advanced Research Projects Agency Information Processing Techniques Office (DARPA IPTO), Tech. Rep, 2008, 15: 181
7 S, Gupta T, Patel C, Engelmann D Tiwari . Failures in large scale systems: Long-term measurement, analysis, and implications. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 2017, 44
8 Radojkovic P, Marazakis M, Carpenter P, Jeyapaul R, Gizopoulos D, Schulz M, Armejach A, Ayguade E A, Bodin F, Canal R, et al. Towards resilient EU HPC systems: A blueprint. PhD thesis, European HPC resilience initiative, 2020
9 A, Avizienis J C, Laprie B, Randell C Landwehr . Basic concepts and taxonomy of dependable and secure computing. IEEE Transactions on Dependable and Secure Computing, 2004, 1( 1): 11–33
10 S Mukherjee . Architecture Design for Soft Errors. San Francisco: Morgan Kaufmann, 2008
11 L, Tan N DeBardeleben . Failure analysis and quantification for contemporary and future supercomputers. 2019, arXiv preprint arXiv: 1911.02118
12 F, Shoji S, Matsui M, Okamoto F, Sueyasu T, Tsukamoto A, Uno K Yamamoto . Long term failure analysis of 10 peta-scale supercomputer. In: Proceedings of HPC in Asia Session at ISC 2015. 2015
13 A, Das F, Mueller C, Siegel A Vishnu . Desh: deep learning for system health prediction of lead times to failure in HPC. In: Proceedings of the 27th International Symposium on High-Performance Parallel and Distributed Computing. 2018, 40–51
14 Martino C, Di Z, Kalbarczyk R K, Iyer F, Baccanico J, Fullop W Kramer . Lessons learned from the analysis of system failures at petascale: the case of blue waters. In: Proceedings of the 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks. 2014, 610–621
15 N, El-Sayed B Schroeder . Reading between the lines of failure logs: understanding how HPC systems fail. In: Proceedings of the 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks. 2013, 1–12
16 Bode B, Butler M, Dunning T, Hoeer T, Kramer W, Gropp W, WenMei H. The blue waters super-system for super-science. In: Contemporary High Performance Computing: From Petascale toward Exascale, 339–366. Chapman and Hall/CRC, 2013
17 B Bland . Titan - Early experience with the titan system at oak ridge national laboratory. In: Proceedings of 2012 SC Companion: High Performance Computing, Networking Storage and Analysis. 2012, 2189–2211
18 L, Bautista-Gomez A, Gainaru S, Perarnau D, Tiwari S, Gupta C, Engelmann F, Cappello M Snir . Reducing waste in extreme scale systems through introspective analysis. In: Proceedings of 2016 IEEE International Parallel and Distributed Processing Symposium. 2016, 212–221
19 D, Tiwari S, Gupta S S Vazhkudai . Lazy checkpointing: exploiting temporal locality in failures to mitigate checkpointing overheads on extreme-scale systems. In: Proceedings of the 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks. 2014, 25–36
20 D, Tiwari S, Gupta G, Gallarno J, Rogers D Maxwell . Reliability lessons learned from GPU experience with the Titan supercomputer at Oak Ridge leadership computing facility. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 2015, 1−12
21 P H, Hargrove J C Duell . Berkeley lab checkpoint/restart (BLCR) for Linux clusters. Journal of Physics: Conference Series, 2006, 46: 494–499
22 Ansel J, Arya K, Cooperman G. DMTCP: Transparent checkpointing for cluster computations and the desktop. In: Proceedings of 2009 IEEE International Symposium on Parallel & Distributed Processing. 2009, 1−12
23 L, Bautista-Gomez S, Tsuboi D, Komatitsch F, Cappello N, Maruyama S Matsuoka . FTI: High performance fault tolerance interface for hybrid systems. In: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis. 2011, 1−12
24 Zhong H, Nieh J. Crak: Linux checkpoint/restart as a kernel module. Technical Report, Citeseer, 2001
25 S, Osman D, Subhraveti G, Su J Nieh . The design and implementation of zap: a system for migrating computing environments. ACM SIGOPS Operating Systems Review, 2002, 36( S1): 361–376
26 S, Sankaran J M, Squyres B, Barrett V, Sahay A, Lumsdaine J, Duell P, Hargrove E Roman . The LAM/MPI checkpoint/restart framework: system-initiated checkpointing. The International Journal of High Performance Computing Applications, 2005, 19( 4): 479–493
27 C, Wang F, Mueller C, Engelmann S L Scott . Hybrid checkpointing for MPI jobs in HPC environments. In: Proceedings of the 16th International Conference on Parallel and Distributed Systems. 2010, 524−533
28 J C, Sancho F, Petrini G, Johnson E Frachtenberg . On the feasibility of incremental checkpointing for scientific computing. In: Proceedings of the 18th International Parallel and Distributed Processing Symposium. 2004, 58
29 S, Agarwal R, Garg M S, Gupta J E Moreira . Adaptive incremental checkpointing for massively parallel systems. In: Proceedings of the 18th Annual International Conference on Supercomputing. 2004, 277−286
30 Bosilca G, Bouteiller A, Cappello F, Djilali S, Fedak G, Germain C, Herault T, Lemarinier P, Lodygensky O, Magniette F, Neri V, Selikhov A. MPICh-V: toward a scalable fault tolerant MPI for volatile nodes. In: Proceedings of 2002 ACM/IEEE Conference on Supercomputing. 2002, 29
31 Bronevetsky G, Marques D, Pingali K, Stodghill P. Automated application-level checkpointing of MPI programs. In: Proceedings of the 9th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 2003, 84−94
32 R L, Graham S E, Choi D J, Daniel N N, Desai R G, Minnich C E, Rasmussen L D, Risinger M W Sukalski . A network-failure-tolerant message-passing system for terascale clusters. International Journal of Parallel Programming, 2003, 31( 4): 285–303
33 N, Woo S, Choi h, Jung J, Moon H Y, Yeom T, Park H Park . MPICH-GF: providing fault tolerance on grid environments. In: Proceedings of the 3rd IEEE//ACM International Symposium on Cluster Computing and the Grid (CCGrid2003), the Poster and Research Demo Session. 2003
34 G, Zheng L, Shi L V Kale . FTC-Charm++: an in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI. In: Proceedings of 2004 IEEE International Conference on Cluster Computing. 2004, 93−103
35 Y, Zhang D, Wong W Zheng . User-level checkpoint and recovery for LAM/MPI. ACM SIGOPS Operating Systems Review, 2005, 39( 3): 72–81
36 D, Buntinas C, Coti T, Herault P, Lemarinier L, Pilard A, Rezmerita E, Rodriguez F Cappello . Blocking vs. non-blocking coordinated checkpointing for large-scale fault tolerant MPI Protocols. Future Generation Computer Systems, 2008, 24( 1): 73–84
37 Ruscio J F, Heffner M A, Varadarajan S. DejaVu: transparent user-level checkpointing, migration, and recovery for distributed systems. In: Proceedings of 2007 IEEE International Parallel and Distributed Processing Symposium. 2007, 1−10
38 J, Cao K, Arya R, Garg S, Matott D K, Panda H, Subramoni J, Vienne G Cooperman . System-level scalable checkpoint-restart for petascale computing. In: Proceedings of the 22nd International Conference on Parallel and Distributed Systems. 2016, 932−941
39 R, Garg G, Price G Cooperman . MANA for MPI: MPI-agnostic network-agnostic transparent checkpointing. In: Proceedings of the 28th International Symposium on High-Performance Parallel and Distributed Computing. 2019, 49−60
40 I, Laguna D F, Richards T, Gamblin M, Schulz Supinski B R, De K, Mohror H Pritchard . Evaluating and extending user-level fault tolerance in MPI applications. The International Journal of High Performance Computing Applications, 2016, 30( 3): 305–319
41 S, Chakraborty I, Laguna M, Emani K, Mohror D K, Panda M, Schulz H Subramoni . EREINIT: scalable and efficient fault-tolerance for bulk-synchronous MPI applications. Concurrency and Computation: Practice and Experience, 2020, 32( 3): e4863
42 G, Georgakoudis L, Guo I Laguna . Reinit++: evaluating the performance of global-restart recovery methods for MPI fault tolerance. In: Proceedings of the 35th International Conference on High Performance Computing. 2020, 536−554
43 G, Bronevetsky D J, Marques K K, Pingali R, Rugina S A McKee . Compiler-enhanced incremental checkpointing for OpenMP applications. In: Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 2008, 275−276
44 R, Arora P, Bangalore M Mernik . A technique for non-invasive application-level checkpointing. The Journal of Supercomputing, 2011, 57( 3): 227–255
45 Ba T N, Arora R. A tool for semi-automatic application-level check- pointing. In: Technical Posters at the International Conference for High Performance Computing, Networking, Storage and Analysis. 2016, 16–20
46 D, Quinlan C Liao . The ROSE source-to-source compiler infrastructure. In: Proceedings of the Cetus Users and Compiler Infrastructure Workshop. 2011, 1−3
47 F, Shahzad J, Thies M, Kreutzer T, Zeiser G, Hager G Wellein . CRAFT: a library for easier application-level checkpoint/restart and automatic fault tolerance. IEEE Transactions on Parallel and Distributed Systems, 2019, 30( 3): 501–514
48 H, Takizawa K, Sato K, Komatsu H Kobayashi . CheCUDA: a checkpoint/restart tool for CUDA applications. In: Proceedings of 2009 International Conference on Parallel and Distributed Computing, Applications and Technologies. 2009, 408−413
49 R Garg . Extending the domain of transparent checkpoint-restart for large-scale HPC. Northeastern University, Dissertation, 2019
50 R, Garg A, Mohan M, Sullivan G Cooperman . CRUM: checkpoint-restart support for CUDA’s unified memory. In: Proceedings of 2018 IEEE International Conference on Cluster Computing. 2018, 302−313
51 T, Jain G Cooperman . CRAC: Checkpoint-restart architecture for CUDA with streams and UVM. In: Proceedings of International Conference for High Performance Computing, Networking, Storage and Analysis. 2020, 1−15
52 K, Lee M B, Sullivan S K S, Hari T, Tsai S W, Keckler M Erez . GPU snapshot: checkpoint offloading for GPU-dense systems. In: Proceedings of the ACM International Conference on Supercomputing. 2019, 171−183
53 S, Kannan N, Farooqui A, Gavrilovska K Schwan . HeteroCheckpoint: efficient checkpointing for accelerator-based systems. In: Proceedings of the 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks. 2014, 738−743
54 N H Vaidya . A case for two-level distributed recovery schemes. In: Proceedings of the 1995 ACM SIGMETRICS Joint International Conference on Measurement and Modeling of Computer Systems. 1995, 64−73
55 J, Haines V, Lakamraju I, Koren C M Krishna . Application-level fault tolerance as a complement to system-level fault tolerance. The Journal of Supercomputing, 2000, 16(1−2): 53−68
56 S, Di Y, Robert F, Vivien F Cappello . Toward an optimal online checkpoint solution under a two-level HPC checkpoint model. IEEE Transactions on Parallel and Distributed Systems, 2017, 28( 1): 244–259
57 A, Benoit A, Cavelan Fèvre V, Le Y, Robert H Sun . Towards optimal multi-level checkpointing. IEEE Transactions on Computers, 2017, 66( 7): 1212–1226
58 K, Ferreira J, Stearley J H, Laros R, Oldfield K, Pedretti R, Brightwell R, Riesen P G, Bridges D Arnold . Evaluating the viability of process replication reliability for exascale systems. In: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis. 2011, 1−12
59 Wu P, Ding C, Chen L, Gao F, Davies T, Karlsson C, Chen Z. Fault tolerant matrix-matrix multiplication: Correcting soft errors on-line. In: Proceedings of the 2nd Workshop on Scalable Algorithms for Large-Scale Systems. 2011, 25−28
60 D, Fiala F, Mueller C, Engelmann R, Riesen K, Ferreira R Brightwell . Detection and correction of silent data corruption for large-scale high-performance computing. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. 2012, 1−12
61 Z, Wang X, Yang Y Zhou . MMPI: a scalable fault tolerance mechanism for MPI large scale parallel computing. In: Proceedings of the 10th IEEE International Conference on Computer and Information Technology. 2010, 1251−1256
62 Z, Hussain T, Znati R Melhem . Partial redundancy in HPC systems with non-uniform node reliabilities. In: Proceedings of International Conference for High Performance Computing, Networking, Storage and Analysis. 2018, 566−576
63 J, Elliott K, Kharbas D, Fiala F, Mueller K, Ferreira C Engelmann . Combining partial redundancy and checkpointing for HPC. In: Proceedings of the 32nd International Conference on Distributed Computing Systems. 2012, 615−626
64 C, George S Vadhiyar . Fault tolerance on large scale systems using adaptive process replication. IEEE Transactions on Computers, 2015, 64( 8): 2213–2225
65 H, Quinn P Graham . Terrestrial-based radiation upsets: a cautionary tale. In: Proceedings of the 13th Annual IEEE Symposium on Field-Programmable Custom Computing Machines. 2005, 193−202
66 B, Schroeder E, Pinheiro W D Weber . DRAM errors in the wild: a large-scale field study. Communications of the ACM, 2011, 54( 2): 100–107
67 Y, Sedaghat S G, Miremadi M Fazeli . A software-based error detection technique using encoded signatures. In: Proceedings of the 21st IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems. 2006, 389−400
68 G, Miremadi J, Harlsson U, Gunneflo J Torin . Two software techniques for on-line error detection. In: Proceedings of the 22nd International Symposium on Fault-Tolerant Computing. 1992, 328−335
69 R, Vemu J Abraham . CEDA: control-flow error detection using assertions. IEEE Transactions on Computers, 2011, 60( 9): 1233–1245
70 H R, Zarandi M, Maghsoudloo N Khoshavi . Two efficient software techniques to detect and correct control-flow errors. In: Proceedings of the 16th Pacific Rim International Symposium on Dependable Computing. 2010, 141−148
71 L B, Gomez F Cappello . Detecting silent data corruption through data dynamic monitoring for scientific applications. ACM SIGPLAN Notices, 2014, 49( 8): 381–382
72 E, Berrocal L, Bautista-Gomez S, Di Z, Lan F Cappello . Lightweight silent data corruption detection based on runtime data analysis for HPC applications. In: Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing. 2015, 275−278
73 T, LeBlanc R, Anand E, Gabriel J Subhlok . VolpexMPI: an MPI library for execution of parallel applications on volatile nodes. In: Proceedings of the 16th European Parallel Virtual Machine / Message Passing Interface Users’ Group Meeting. 2009, 124−133
74 C, Engelmann S Boehm . Redundant execution of HPC applications with MR-MPI. In: Proceedings of the 10th IASTED International Conference on Parallel and Distributed Computing and Networks. 2011, 31−38
75 E, Berrocal L, Bautista-Gomez S, Di Z, Lan F Cappello . Toward general software level silent data corruption detection for parallel applications. IEEE Transactions on Parallel and Distributed Systems, 2017, 28( 12): 3642–3655
76 D, Fiala K B, Ferreira F, Mueller C Engelmann . A tunable, software-based DRAM error detection and correction library for HPC. In: Proceedings of European Conference on Parallel Processing. 2012, 251−261
77 D, Fiala F, Mueller K B Ferreira . FlipSphere: a software-based DRAM error detection and correction library for HPC. In: Proceedings of the 20th International Symposium on Distributed Simulation and Real Time Applications. 2016, 19−28
78 Fiala D, Mueller F, Ferreira K, Engelmann C. Mini-Ckpts: surviving OS failures in persistent memory. In: Proceedings of 2016 International Conference on Supercomputing. 2016, 7
79 K H, Huang J A Abraham . Algorithm-based fault tolerance for matrix operations. IEEE Transactions on Computers, 1984, C-33( 6): 518–528
80 F T, Luk H Park . Fault-tolerant matrix triangularizations on systolic arrays. IEEE Transactions on Computers, 1988, 37( 11): 1434–1438
81 F T, Luk H Park . An analysis of algorithm-based fault tolerance techniques. Journal of Parallel and Distributed Computing, 1988, 5( 2): 172–184
82 A, Bouteiller T, Herault G, Bosilca P, Du J Dongarra . Algorithm-based fault tolerance for dense matrix factorizations, multiple failures and accuracy. ACM Transactions on Parallel Computing, 2015, 1( 2): 10
83 Z Chen . Online-ABFT: an online algorithm based fault tolerance scheme for soft error detection in iterative methods. ACM SIGPLAN Notices, 2013, 48( 8): 167–176
84 D, Tao S L, Song S, Krishnamoorthy P, Wu X, Liang E Z, Zhang D, Kerbyson Z Chen . New-sum: a novel online ABFT scheme for general iterative methods. In: Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing. 2016, 43−55
85 A, Schöll C, Braun M A, Kochte H J Wunderlich . Efficient algorithm-based fault tolerance for sparse matrix operations. In: Proceedings of the 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks. 2016, 251−262
86 M, Shantharam S, Srinivasmurthy P Raghavan . Fault tolerant preconditioned conjugate gradient for sparse linear system solution. In: Proceedings of the 26th ACM International Conference on Supercomputing. 2012, 69−78
87 Y, Zhu Y, Liu M, Li D Qian . Block-checksum-based fault tolerance for matrix multiplication on large-scale parallel systems. In: Proceedings of the 20th International Conference on High Performance Computing and Communications; IEEE 16th International Conference on Smart City; IEEE 4th International Conference on Data Science and Systems. 2018, 172−179
88 Y, Zhu Y, Liu G Zhang . FT-PBLAS: PBLAS-based fault-tolerant linear algebra computation on high-performance computing systems. IEEE Access, 2020, 8: 42674–42688
89 Z, Chen J Dongarra . Algorithm-based fault tolerance for fail-stop failures. IEEE Transactions on Parallel and Distributed Systems, 2008, 19( 12): 1628–1641
90 T, Roche M, Cunche J L Roch . Algorithm-based fault tolerance applied to P2P computing networks. In: Proceedings of the 1st International Conference on Advances in P2P Systems. 2009, 144−149
91 D, Hakkarinen P, Wu Z Chen . Fail-stop failure algorithm-based fault tolerance for Cholesky decomposition. IEEE Transactions on Parallel and Distributed Systems, 2015, 26( 5): 1323–1335
92 T, Davies C, Karlsson H, Liu C, Ding Z Chen . High performance linpack benchmark: a fault tolerant implementation without checkpointing. In: Proceedings of the International Conference on Supercomputing. 2011, 162−171
93 J, Chen S, Li Z Chen . GPU-ABFT: optimizing algorithm-based fault tolerance for heterogeneous systems with GPUs. In: Proceedings of 2016 IEEE International Conference on Networking, Architecture and Storage. 2016, 1−2
94 J, Chen H, Li S, Li X, Liang P, Wu D, Tao K, Ouyang Y, Liu K, Zhao Q, Guan Z Chen . Fault tolerant one-sided matrix decompositions on heterogeneous systems with GPUs. In: Proceedings of International Conference for High Performance Computing, Networking, Storage and Analysis. 2018, 854−865
95 C, Braun S, Halder H J Wunderlich . A-ABFT: autonomous algorithm-based fault tolerance for matrix multiplications on graphics processing units. In: Proceedings of the 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks. 2014, 443−454
96 S, Ranganathan A D, George R W, Todd M C Chidester . Gossip-style failure detection and distributed consensus for scalable heterogeneous clusters. Cluster Computing, 2001, 4( 3): 197–209
97 M, Gabel A, Schuster R G, Bachrach N Bjørner . Latent fault detection in large scale services. In: Proceedings of the IEEE/IFIP International Conference on Dependable Systems and Networks. 2012, 1−12
98 L, Wu H, Luo J, Zhan D Meng . A runtime fault detection method for HPC cluster. In: Proceedings of the 12th International Conference on Parallel and Distributed Computing, Applications and Technologies. 2011, 68−72
99 S, Ghiasvand F M Ciorba . Anomaly detection in high performance computers: a vicinity perspective. In: Proceedings of the 18th International Symposium on Parallel and Distributed Computing. 2019, 112−120
100 I P, Egwutuoha S, Chen D, Levy B, Selic R Calvo . Cost-oriented proactive fault tolerance approach to high performance computing (HPC) in the cloud. International Journal of Parallel, Emergent and Distributed Systems, 2014, 29( 4): 363–378
101 A, Borghesi A, Libri L, Benini A Bartolini . Online anomaly detection in HPC systems. In: Proceedings of 2019 IEEE International Conference on Artificial Intelligence Circuits and Systems. 2019, 229−233
102 A, Borghesi M, Molan M, Milano A Bartolini . Anomaly detection and anticipation in high performance computing systems. IEEE Transactions on Parallel and Distributed Systems, 2022, 33( 4): 739–750
103 M C, Dani H, Doreau S Alt . K-means application for anomaly detection and log classification in HPC. In: Proceedings of the 30th International Conference on Industrial, Engineering and Other Applications of Applied Intelligent Systems. 2017, 201−210
104 B, Zhu G, Wang X, Liu D, Hu S, Lin J Ma . Proactive drive failure prediction for large scale storage systems. In: Proceedings of the 29th Symposium on Mass Storage Systems and Technologies. 2013, 1−5
105 Fulp E W, Fink G A, Haack J N. Predicting computer system failures using support vector machines. In: Proceedings of the 1st USENIX Conference on Analysis of System Logs. 2008, 5
106 S, Ganguly A, Consul A, Khan B, Bussone J, Richards A Miguel . A practical approach to hard disk failure prediction in cloud platforms: big data model for failure management in datacenters. In: Proceedings of the 2nd International Conference on Big Data Computing Service and Applications. 2016, 105−116
107 B, Krammer K, Bidmon M S, Müller M M Resch . MARMOT: an MPI analysis and checking tool. Advances in Parallel Computing, 2004, 13: 493–500
108 Vetter J S, De Supinski B R. Dynamic software testing of MPI applications with Umpire. In: Proceedings of 2000 ACM/IEEE Conference on Supercomputing. 2000, 51
109 J, Gao K, Yu P Qing . A scalable runtime fault detection mechanism for high performance computing. In: Proceedings of the 2nd Information Technology, Networking, Electronic and Automation Control Conference. 2017, 490−495
110 K, Kharbas D, Kim T, Hoefler F Mueller . Assessing HPC failure detectors for MPI jobs. In: Proceedings of the 20th Euromicro International Conference on Parallel, Distributed and Network-Based Processing. 2012, 81−88
111 Y, Liang Y, Zhang A, Sivasubramaniam M, Jette R Sahoo . BlueGene/L failure analysis and prediction models. In: Proceedings of the International Conference on Dependable Systems and Networks. 2006, 425−434
112 A, Gainaru F, Cappello M, Snir W Kramer . Fault prediction under the microscope: a closer look into HPC systems. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. 2012, 1−11
113 Gainaru A, Cappello F, Kramer W. Taming of the shrew: modeling the normal and faulty behaviour of large-scale HPC systems. In: Proceedings of the 26th International Parallel and Distributed Processing Symposium. 2012, 1168−1179
114 A, Pelaez A, Quiroz J C, Browne E, Chuah M Parashar . Online failure prediction for HPC resources using decentralized clustering. In: Proceedings of the 21st International Conference on High Performance Computing. 2014, 1−9
115 H S, Gunawi R O, Suminto R, Sears C, Golliher S, Sundararaman , et al.. Fail-slow at scale: evidence of hardware performance faults in large production systems. In: Proceedings of the 16th USENIX Conference on File and Storage Technologies. 2018, 1−14
[1] FCS-22096-OF-JJ_suppl_1 Download
[1] Rong ZENG, Xiaofeng HOU, Lu ZHANG, Chao LI, Wenli ZHENG, Minyi GUO. Performance optimization for cloud computing systems in the microservice era: state-of-the-art and research opportunities[J]. Front. Comput. Sci., 2022, 16(6): 166106-.
[2] Yuejun LIU, Yongbin ZHOU, Rui ZHANG, Yang TAO. (Full) Leakage resilience of Fiat-Shamir signatures over lattices[J]. Front. Comput. Sci., 2022, 16(5): 165819-.
[3] Yao SONG, Limin XIAO, Liang WANG, Guangjun QIN, Bing WEI, Baicheng YAN, Chenhao ZHANG. GCSS: a global collaborative scheduling strategy for wide-area high-performance computing[J]. Front. Comput. Sci., 2022, 16(5): 165105-.
[4] Zhengxiong HOU, Hong SHEN, Xingshe ZHOU, Jianhua GU, Yunlan WANG, Tianhai ZHAO. Prediction of job characteristics for intelligent resource allocation in HPC systems: a survey and future directions[J]. Front. Comput. Sci., 2022, 16(5): 165107-.
[5] Xiaotong WANG, Chunxi ZHANG, Junhua FANG, Rong ZHANG, Weining QIAN, Aoying ZHOU. A comprehensive study on fault tolerance in stream processing systems[J]. Front. Comput. Sci., 2022, 16(2): 162603-.
[6] Yanwei ZHOU, Bo YANG. Practical continuous leakage-resilient CCA secure identity-based encryption[J]. Front. Comput. Sci., 2020, 14(4): 144804-.
[7] Tao ZHU, Huiqi HU, Weining QIAN, Huan ZHOU, Aoying ZHOU. Fault-tolerant precise data access on distributed log-structured merge-tree[J]. Front. Comput. Sci., 2019, 13(4): 760-777.
[8] Dan HAO,Lu ZHANG,Hong MEI. Test-case prioritization: achievements and challenges[J]. Front. Comput. Sci., 2016, 10(5): 769-777.
[9] Mingwu ZHANG,Yi MU. Key continual-leakage resilient broadcast cryptosystem from dual system in broadcast networks[J]. Front. Comput. Sci., 2014, 8(3): 456-468.
[10] Qiqi LAI,Yuan CHEN,Yupu HU,Baocang WANG,Mingming JIANG. Construction of a key-dependent message secure symmetric encryption scheme in the ideal cipher model[J]. Front. Comput. Sci., 2014, 8(3): 469-477.
[11] Xuejun YANG, Xiangke LIAO, Weixia XU, Junqiang SONG, Qingfeng HU, Jinshu SU, Liquan XIAO, Kai LU, Qiang DOU, Juping JIANG, Canqun YANG, . TH-1: China’s first petaflop supercomputer[J]. Front. Comput. Sci., 2010, 4(4): 445-455.
[12] WANG Yuanzhuo, LIN Chuang, YANG Yang, SHAN Zhiguang. Performance analysis of a dependable scheduling strategy based on a fault-tolerant grid model[J]. Front. Comput. Sci., 2007, 1(3): 329-337.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed