Software approaches for resilience of high performance computing systems: a survey |
Jie JIA1,2( ), Yi LIU1,2, Guozhen ZHANG1,2, Yulin GAO1,2, Depei QIAN1,2 |
1. School of Computer Science and Engineering, Beihang University, Beijing 100191, China 2. Sino-German Joint Software Institute, Beihang University, Beijing 100191, China |
Abstract With the scaling up of high-performance computing systems in recent years, their reliability has been descending continuously. Therefore, system resilience has been regarded as one of the critical challenges for large-scale HPC systems. Various techniques and systems have been proposed to ensure the correct execution and completion of parallel programs. This paper provides a comprehensive survey of existing software resilience approaches. Firstly, a classification of software resilience approaches is presented; then we introduce major approaches and techniques, including checkpointing, replication, soft error resilience, algorithm-based fault tolerance, fault detection and prediction. In addition, challenges exposed by system-scale and heterogeneous architecture are also discussed.
high-performance computing
fault tolerance
Corresponding Author(s):
Just Accepted Date: 07 September 2022
Issue Date: 12 December 2022
1 |
Dongarra J. Report on the fujitsu fugaku system. University of Tennessee-Knoxville Innovative Computing Laboratory, Tech. Rep. ICLUT-20-06, 2020
2 |
Martino C, Di W, Kramer Z, Kalbarczyk R Iyer . Measuring and understanding extreme-scale application resilience: a field study of 5, 000, 000 HPC application runs. In: Proceedings of the 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks. 2015, 25–36
3 |
J, Hursey J M, Squyres T I, Mattox A Lumsdaine . The design and implementation of checkpoint/restart process fault Tolerance for open MPI. In: Proceedings of 2007 IEEE International Parallel and Distributed Processing Symposium. 2007, 1–8
4 |
F, Cappello A, Geist B, Gropp L, Kale B, Kramer M Snir . Toward exascale resilience. The International Journal of High Performance Computing Applications, 2009, 23( 4): 374–388
5 |
I P, Egwutuoha D, Levy B, Selic S Chen . A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems. The Journal of Supercomputing, 2013, 65( 3): 1302–1326
6 |
Bergman K, Borkar S, Campbell D, Carlson W, Dally W, Denneau M, Franzon P, Harrod W, Hill K, Hiller J, et al. Exascale computing study: Technology challenges in achieving exascale systems. Defense Advanced Research Projects Agency Information Processing Techniques Office (DARPA IPTO), Tech. Rep, 2008, 15: 181
7 |
S, Gupta T, Patel C, Engelmann D Tiwari . Failures in large scale systems: Long-term measurement, analysis, and implications. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 2017, 44
8 |
Radojkovic P, Marazakis M, Carpenter P, Jeyapaul R, Gizopoulos D, Schulz M, Armejach A, Ayguade E A, Bodin F, Canal R, et al. Towards resilient EU HPC systems: A blueprint. PhD thesis, European HPC resilience initiative, 2020
9 |
A, Avizienis J C, Laprie B, Randell C Landwehr . Basic concepts and taxonomy of dependable and secure computing. IEEE Transactions on Dependable and Secure Computing, 2004, 1( 1): 11–33
10 |
S Mukherjee . Architecture Design for Soft Errors. San Francisco: Morgan Kaufmann, 2008
11 |
L, Tan N DeBardeleben . Failure analysis and quantification for contemporary and future supercomputers. 2019, arXiv preprint arXiv: 1911.02118
12 |
F, Shoji S, Matsui M, Okamoto F, Sueyasu T, Tsukamoto A, Uno K Yamamoto . Long term failure analysis of 10 peta-scale supercomputer. In: Proceedings of HPC in Asia Session at ISC 2015. 2015
13 |
A, Das F, Mueller C, Siegel A Vishnu . Desh: deep learning for system health prediction of lead times to failure in HPC. In: Proceedings of the 27th International Symposium on High-Performance Parallel and Distributed Computing. 2018, 40–51
14 |
Martino C, Di Z, Kalbarczyk R K, Iyer F, Baccanico J, Fullop W Kramer . Lessons learned from the analysis of system failures at petascale: the case of blue waters. In: Proceedings of the 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks. 2014, 610–621
15 |
N, El-Sayed B Schroeder . Reading between the lines of failure logs: understanding how HPC systems fail. In: Proceedings of the 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks. 2013, 1–12
16 |
Bode B, Butler M, Dunning T, Hoeer T, Kramer W, Gropp W, WenMei H. The blue waters super-system for super-science. In: Contemporary High Performance Computing: From Petascale toward Exascale, 339–366. Chapman and Hall/CRC, 2013
17 |
B Bland . Titan - Early experience with the titan system at oak ridge national laboratory. In: Proceedings of 2012 SC Companion: High Performance Computing, Networking Storage and Analysis. 2012, 2189–2211
18 |
L, Bautista-Gomez A, Gainaru S, Perarnau D, Tiwari S, Gupta C, Engelmann F, Cappello M Snir . Reducing waste in extreme scale systems through introspective analysis. In: Proceedings of 2016 IEEE International Parallel and Distributed Processing Symposium. 2016, 212–221
19 |
D, Tiwari S, Gupta S S Vazhkudai . Lazy checkpointing: exploiting temporal locality in failures to mitigate checkpointing overheads on extreme-scale systems. In: Proceedings of the 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks. 2014, 25–36
20 |
D, Tiwari S, Gupta G, Gallarno J, Rogers D Maxwell . Reliability lessons learned from GPU experience with the Titan supercomputer at Oak Ridge leadership computing facility. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 2015, 1−12
21 |
P H, Hargrove J C Duell . Berkeley lab checkpoint/restart (BLCR) for Linux clusters. Journal of Physics: Conference Series, 2006, 46: 494–499
22 |
Ansel J, Arya K, Cooperman G. DMTCP: Transparent checkpointing for cluster computations and the desktop. In: Proceedings of 2009 IEEE International Symposium on Parallel & Distributed Processing. 2009, 1−12
23 |
L, Bautista-Gomez S, Tsuboi D, Komatitsch F, Cappello N, Maruyama S Matsuoka . FTI: High performance fault tolerance interface for hybrid systems. In: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis. 2011, 1−12
24 |
Zhong H, Nieh J. Crak: Linux checkpoint/restart as a kernel module. Technical Report, Citeseer, 2001
25 |
S, Osman D, Subhraveti G, Su J Nieh . The design and implementation of zap: a system for migrating computing environments. ACM SIGOPS Operating Systems Review, 2002, 36( S1): 361–376
26 |
S, Sankaran J M, Squyres B, Barrett V, Sahay A, Lumsdaine J, Duell P, Hargrove E Roman . The LAM/MPI checkpoint/restart framework: system-initiated checkpointing. The International Journal of High Performance Computing Applications, 2005, 19( 4): 479–493
27 |
C, Wang F, Mueller C, Engelmann S L Scott . Hybrid checkpointing for MPI jobs in HPC environments. In: Proceedings of the 16th International Conference on Parallel and Distributed Systems. 2010, 524−533
28 |
J C, Sancho F, Petrini G, Johnson E Frachtenberg . On the feasibility of incremental checkpointing for scientific computing. In: Proceedings of the 18th International Parallel and Distributed Processing Symposium. 2004, 58
29 |
S, Agarwal R, Garg M S, Gupta J E Moreira . Adaptive incremental checkpointing for massively parallel systems. In: Proceedings of the 18th Annual International Conference on Supercomputing. 2004, 277−286
30 |
Bosilca G, Bouteiller A, Cappello F, Djilali S, Fedak G, Germain C, Herault T, Lemarinier P, Lodygensky O, Magniette F, Neri V, Selikhov A. MPICh-V: toward a scalable fault tolerant MPI for volatile nodes. In: Proceedings of 2002 ACM/IEEE Conference on Supercomputing. 2002, 29
31 |
Bronevetsky G, Marques D, Pingali K, Stodghill P. Automated application-level checkpointing of MPI programs. In: Proceedings of the 9th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 2003, 84−94
32 |
R L, Graham S E, Choi D J, Daniel N N, Desai R G, Minnich C E, Rasmussen L D, Risinger M W Sukalski . A network-failure-tolerant message-passing system for terascale clusters. International Journal of Parallel Programming, 2003, 31( 4): 285–303
33 |
N, Woo S, Choi h, Jung J, Moon H Y, Yeom T, Park H Park . MPICH-GF: providing fault tolerance on grid environments. In: Proceedings of the 3rd IEEE//ACM International Symposium on Cluster Computing and the Grid (CCGrid2003), the Poster and Research Demo Session. 2003
34 |
G, Zheng L, Shi L V Kale . FTC-Charm++: an in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI. In: Proceedings of 2004 IEEE International Conference on Cluster Computing. 2004, 93−103
35 |
Y, Zhang D, Wong W Zheng . User-level checkpoint and recovery for LAM/MPI. ACM SIGOPS Operating Systems Review, 2005, 39( 3): 72–81
36 |
D, Buntinas C, Coti T, Herault P, Lemarinier L, Pilard A, Rezmerita E, Rodriguez F Cappello . Blocking vs. non-blocking coordinated checkpointing for large-scale fault tolerant MPI Protocols. Future Generation Computer Systems, 2008, 24( 1): 73–84
37 |
Ruscio J F, Heffner M A, Varadarajan S. DejaVu: transparent user-level checkpointing, migration, and recovery for distributed systems. In: Proceedings of 2007 IEEE International Parallel and Distributed Processing Symposium. 2007, 1−10
38 |
J, Cao K, Arya R, Garg S, Matott D K, Panda H, Subramoni J, Vienne G Cooperman . System-level scalable checkpoint-restart for petascale computing. In: Proceedings of the 22nd International Conference on Parallel and Distributed Systems. 2016, 932−941
39 |
R, Garg G, Price G Cooperman . MANA for MPI: MPI-agnostic network-agnostic transparent checkpointing. In: Proceedings of the 28th International Symposium on High-Performance Parallel and Distributed Computing. 2019, 49−60
40 |
I, Laguna D F, Richards T, Gamblin M, Schulz Supinski B R, De K, Mohror H Pritchard . Evaluating and extending user-level fault tolerance in MPI applications. The International Journal of High Performance Computing Applications, 2016, 30( 3): 305–319
41 |
S, Chakraborty I, Laguna M, Emani K, Mohror D K, Panda M, Schulz H Subramoni . EREINIT: scalable and efficient fault-tolerance for bulk-synchronous MPI applications. Concurrency and Computation: Practice and Experience, 2020, 32( 3): e4863
42 |
G, Georgakoudis L, Guo I Laguna . Reinit++: evaluating the performance of global-restart recovery methods for MPI fault tolerance. In: Proceedings of the 35th International Conference on High Performance Computing. 2020, 536−554
43 |
G, Bronevetsky D J, Marques K K, Pingali R, Rugina S A McKee . Compiler-enhanced incremental checkpointing for OpenMP applications. In: Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 2008, 275−276
44 |
R, Arora P, Bangalore M Mernik . A technique for non-invasive application-level checkpointing. The Journal of Supercomputing, 2011, 57( 3): 227–255
45 |
Ba T N, Arora R. A tool for semi-automatic application-level check- pointing. In: Technical Posters at the International Conference for High Performance Computing, Networking, Storage and Analysis. 2016, 16–20
46 |
D, Quinlan C Liao . The ROSE source-to-source compiler infrastructure. In: Proceedings of the Cetus Users and Compiler Infrastructure Workshop. 2011, 1−3
47 |
F, Shahzad J, Thies M, Kreutzer T, Zeiser G, Hager G Wellein . CRAFT: a library for easier application-level checkpoint/restart and automatic fault tolerance. IEEE Transactions on Parallel and Distributed Systems, 2019, 30( 3): 501–514
48 |
H, Takizawa K, Sato K, Komatsu H Kobayashi . CheCUDA: a checkpoint/restart tool for CUDA applications. In: Proceedings of 2009 International Conference on Parallel and Distributed Computing, Applications and Technologies. 2009, 408−413
49 |
R Garg . Extending the domain of transparent checkpoint-restart for large-scale HPC. Northeastern University, Dissertation, 2019
50 |
R, Garg A, Mohan M, Sullivan G Cooperman . CRUM: checkpoint-restart support for CUDA’s unified memory. In: Proceedings of 2018 IEEE International Conference on Cluster Computing. 2018, 302−313
51 |
T, Jain G Cooperman . CRAC: Checkpoint-restart architecture for CUDA with streams and UVM. In: Proceedings of International Conference for High Performance Computing, Networking, Storage and Analysis. 2020, 1−15
52 |
K, Lee M B, Sullivan S K S, Hari T, Tsai S W, Keckler M Erez . GPU snapshot: checkpoint offloading for GPU-dense systems. In: Proceedings of the ACM International Conference on Supercomputing. 2019, 171−183
53 |
S, Kannan N, Farooqui A, Gavrilovska K Schwan . HeteroCheckpoint: efficient checkpointing for accelerator-based systems. In: Proceedings of the 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks. 2014, 738−743
54 |
N H Vaidya . A case for two-level distributed recovery schemes. In: Proceedings of the 1995 ACM SIGMETRICS Joint International Conference on Measurement and Modeling of Computer Systems. 1995, 64−73
55 |
J, Haines V, Lakamraju I, Koren C M Krishna . Application-level fault tolerance as a complement to system-level fault tolerance. The Journal of Supercomputing, 2000, 16(1−2): 53−68
56 |
S, Di Y, Robert F, Vivien F Cappello . Toward an optimal online checkpoint solution under a two-level HPC checkpoint model. IEEE Transactions on Parallel and Distributed Systems, 2017, 28( 1): 244–259
57 |
A, Benoit A, Cavelan Fèvre V, Le Y, Robert H Sun . Towards optimal multi-level checkpointing. IEEE Transactions on Computers, 2017, 66( 7): 1212–1226
58 |
K, Ferreira J, Stearley J H, Laros R, Oldfield K, Pedretti R, Brightwell R, Riesen P G, Bridges D Arnold . Evaluating the viability of process replication reliability for exascale systems. In: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis. 2011, 1−12
59 |
Wu P, Ding C, Chen L, Gao F, Davies T, Karlsson C, Chen Z. Fault tolerant matrix-matrix multiplication: Correcting soft errors on-line. In: Proceedings of the 2nd Workshop on Scalable Algorithms for Large-Scale Systems. 2011, 25−28
60 |
D, Fiala F, Mueller C, Engelmann R, Riesen K, Ferreira R Brightwell . Detection and correction of silent data corruption for large-scale high-performance computing. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. 2012, 1−12
61 |
Z, Wang X, Yang Y Zhou . MMPI: a scalable fault tolerance mechanism for MPI large scale parallel computing. In: Proceedings of the 10th IEEE International Conference on Computer and Information Technology. 2010, 1251−1256
62 |
Z, Hussain T, Znati R Melhem . Partial redundancy in HPC systems with non-uniform node reliabilities. In: Proceedings of International Conference for High Performance Computing, Networking, Storage and Analysis. 2018, 566−576
63 |
J, Elliott K, Kharbas D, Fiala F, Mueller K, Ferreira C Engelmann . Combining partial redundancy and checkpointing for HPC. In: Proceedings of the 32nd International Conference on Distributed Computing Systems. 2012, 615−626
64 |
C, George S Vadhiyar . Fault tolerance on large scale systems using adaptive process replication. IEEE Transactions on Computers, 2015, 64( 8): 2213–2225
65 |
H, Quinn P Graham . Terrestrial-based radiation upsets: a cautionary tale. In: Proceedings of the 13th Annual IEEE Symposium on Field-Programmable Custom Computing Machines. 2005, 193−202
66 |
B, Schroeder E, Pinheiro W D Weber . DRAM errors in the wild: a large-scale field study. Communications of the ACM, 2011, 54( 2): 100–107
67 |
Y, Sedaghat S G, Miremadi M Fazeli . A software-based error detection technique using encoded signatures. In: Proceedings of the 21st IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems. 2006, 389−400
68 |
G, Miremadi J, Harlsson U, Gunneflo J Torin . Two software techniques for on-line error detection. In: Proceedings of the 22nd International Symposium on Fault-Tolerant Computing. 1992, 328−335
69 |
R, Vemu J Abraham . CEDA: control-flow error detection using assertions. IEEE Transactions on Computers, 2011, 60( 9): 1233–1245
70 |
H R, Zarandi M, Maghsoudloo N Khoshavi . Two efficient software techniques to detect and correct control-flow errors. In: Proceedings of the 16th Pacific Rim International Symposium on Dependable Computing. 2010, 141−148
71 |
L B, Gomez F Cappello . Detecting silent data corruption through data dynamic monitoring for scientific applications. ACM SIGPLAN Notices, 2014, 49( 8): 381–382
72 |
E, Berrocal L, Bautista-Gomez S, Di Z, Lan F Cappello . Lightweight silent data corruption detection based on runtime data analysis for HPC applications. In: Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing. 2015, 275−278
73 |
T, LeBlanc R, Anand E, Gabriel J Subhlok . VolpexMPI: an MPI library for execution of parallel applications on volatile nodes. In: Proceedings of the 16th European Parallel Virtual Machine / Message Passing Interface Users’ Group Meeting. 2009, 124−133
74 |
C, Engelmann S Boehm . Redundant execution of HPC applications with MR-MPI. In: Proceedings of the 10th IASTED International Conference on Parallel and Distributed Computing and Networks. 2011, 31−38
75 |
E, Berrocal L, Bautista-Gomez S, Di Z, Lan F Cappello . Toward general software level silent data corruption detection for parallel applications. IEEE Transactions on Parallel and Distributed Systems, 2017, 28( 12): 3642–3655
76 |
D, Fiala K B, Ferreira F, Mueller C Engelmann . A tunable, software-based DRAM error detection and correction library for HPC. In: Proceedings of European Conference on Parallel Processing. 2012, 251−261
77 |
D, Fiala F, Mueller K B Ferreira . FlipSphere: a software-based DRAM error detection and correction library for HPC. In: Proceedings of the 20th International Symposium on Distributed Simulation and Real Time Applications. 2016, 19−28
78 |
Fiala D, Mueller F, Ferreira K, Engelmann C. Mini-Ckpts: surviving OS failures in persistent memory. In: Proceedings of 2016 International Conference on Supercomputing. 2016, 7
79 |
K H, Huang J A Abraham . Algorithm-based fault tolerance for matrix operations. IEEE Transactions on Computers, 1984, C-33( 6): 518–528
80 |
F T, Luk H Park . Fault-tolerant matrix triangularizations on systolic arrays. IEEE Transactions on Computers, 1988, 37( 11): 1434–1438
81 |
F T, Luk H Park . An analysis of algorithm-based fault tolerance techniques. Journal of Parallel and Distributed Computing, 1988, 5( 2): 172–184
82 |
A, Bouteiller T, Herault G, Bosilca P, Du J Dongarra . Algorithm-based fault tolerance for dense matrix factorizations, multiple failures and accuracy. ACM Transactions on Parallel Computing, 2015, 1( 2): 10
83 |
Z Chen . Online-ABFT: an online algorithm based fault tolerance scheme for soft error detection in iterative methods. ACM SIGPLAN Notices, 2013, 48( 8): 167–176
84 |
D, Tao S L, Song S, Krishnamoorthy P, Wu X, Liang E Z, Zhang D, Kerbyson Z Chen . New-sum: a novel online ABFT scheme for general iterative methods. In: Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing. 2016, 43−55
85 |
A, Schöll C, Braun M A, Kochte H J Wunderlich . Efficient algorithm-based fault tolerance for sparse matrix operations. In: Proceedings of the 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks. 2016, 251−262
86 |
M, Shantharam S, Srinivasmurthy P Raghavan . Fault tolerant preconditioned conjugate gradient for sparse linear system solution. In: Proceedings of the 26th ACM International Conference on Supercomputing. 2012, 69−78
87 |
Y, Zhu Y, Liu M, Li D Qian . Block-checksum-based fault tolerance for matrix multiplication on large-scale parallel systems. In: Proceedings of the 20th International Conference on High Performance Computing and Communications; IEEE 16th International Conference on Smart City; IEEE 4th International Conference on Data Science and Systems. 2018, 172−179
88 |
Y, Zhu Y, Liu G Zhang . FT-PBLAS: PBLAS-based fault-tolerant linear algebra computation on high-performance computing systems. IEEE Access, 2020, 8: 42674–42688
89 |
Z, Chen J Dongarra . Algorithm-based fault tolerance for fail-stop failures. IEEE Transactions on Parallel and Distributed Systems, 2008, 19( 12): 1628–1641
90 |
T, Roche M, Cunche J L Roch . Algorithm-based fault tolerance applied to P2P computing networks. In: Proceedings of the 1st International Conference on Advances in P2P Systems. 2009, 144−149
91 |
D, Hakkarinen P, Wu Z Chen . Fail-stop failure algorithm-based fault tolerance for Cholesky decomposition. IEEE Transactions on Parallel and Distributed Systems, 2015, 26( 5): 1323–1335
92 |
T, Davies C, Karlsson H, Liu C, Ding Z Chen . High performance linpack benchmark: a fault tolerant implementation without checkpointing. In: Proceedings of the International Conference on Supercomputing. 2011, 162−171
93 |
J, Chen S, Li Z Chen . GPU-ABFT: optimizing algorithm-based fault tolerance for heterogeneous systems with GPUs. In: Proceedings of 2016 IEEE International Conference on Networking, Architecture and Storage. 2016, 1−2
94 |
J, Chen H, Li S, Li X, Liang P, Wu D, Tao K, Ouyang Y, Liu K, Zhao Q, Guan Z Chen . Fault tolerant one-sided matrix decompositions on heterogeneous systems with GPUs. In: Proceedings of International Conference for High Performance Computing, Networking, Storage and Analysis. 2018, 854−865
95 |
C, Braun S, Halder H J Wunderlich . A-ABFT: autonomous algorithm-based fault tolerance for matrix multiplications on graphics processing units. In: Proceedings of the 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks. 2014, 443−454
96 |
S, Ranganathan A D, George R W, Todd M C Chidester . Gossip-style failure detection and distributed consensus for scalable heterogeneous clusters. Cluster Computing, 2001, 4( 3): 197–209
97 |
M, Gabel A, Schuster R G, Bachrach N Bjørner . Latent fault detection in large scale services. In: Proceedings of the IEEE/IFIP International Conference on Dependable Systems and Networks. 2012, 1−12
98 |
L, Wu H, Luo J, Zhan D Meng . A runtime fault detection method for HPC cluster. In: Proceedings of the 12th International Conference on Parallel and Distributed Computing, Applications and Technologies. 2011, 68−72
99 |
S, Ghiasvand F M Ciorba . Anomaly detection in high performance computers: a vicinity perspective. In: Proceedings of the 18th International Symposium on Parallel and Distributed Computing. 2019, 112−120
100 |
I P, Egwutuoha S, Chen D, Levy B, Selic R Calvo . Cost-oriented proactive fault tolerance approach to high performance computing (HPC) in the cloud. International Journal of Parallel, Emergent and Distributed Systems, 2014, 29( 4): 363–378
101 |
A, Borghesi A, Libri L, Benini A Bartolini . Online anomaly detection in HPC systems. In: Proceedings of 2019 IEEE International Conference on Artificial Intelligence Circuits and Systems. 2019, 229−233
102 |
A, Borghesi M, Molan M, Milano A Bartolini . Anomaly detection and anticipation in high performance computing systems. IEEE Transactions on Parallel and Distributed Systems, 2022, 33( 4): 739–750
103 |
M C, Dani H, Doreau S Alt . K-means application for anomaly detection and log classification in HPC. In: Proceedings of the 30th International Conference on Industrial, Engineering and Other Applications of Applied Intelligent Systems. 2017, 201−210
104 |
B, Zhu G, Wang X, Liu D, Hu S, Lin J Ma . Proactive drive failure prediction for large scale storage systems. In: Proceedings of the 29th Symposium on Mass Storage Systems and Technologies. 2013, 1−5
105 |
Fulp E W, Fink G A, Haack J N. Predicting computer system failures using support vector machines. In: Proceedings of the 1st USENIX Conference on Analysis of System Logs. 2008, 5
106 |
S, Ganguly A, Consul A, Khan B, Bussone J, Richards A Miguel . A practical approach to hard disk failure prediction in cloud platforms: big data model for failure management in datacenters. In: Proceedings of the 2nd International Conference on Big Data Computing Service and Applications. 2016, 105−116
107 |
B, Krammer K, Bidmon M S, Müller M M Resch . MARMOT: an MPI analysis and checking tool. Advances in Parallel Computing, 2004, 13: 493–500
108 |
Vetter J S, De Supinski B R. Dynamic software testing of MPI applications with Umpire. In: Proceedings of 2000 ACM/IEEE Conference on Supercomputing. 2000, 51
109 |
J, Gao K, Yu P Qing . A scalable runtime fault detection mechanism for high performance computing. In: Proceedings of the 2nd Information Technology, Networking, Electronic and Automation Control Conference. 2017, 490−495
110 |
K, Kharbas D, Kim T, Hoefler F Mueller . Assessing HPC failure detectors for MPI jobs. In: Proceedings of the 20th Euromicro International Conference on Parallel, Distributed and Network-Based Processing. 2012, 81−88
111 |
Y, Liang Y, Zhang A, Sivasubramaniam M, Jette R Sahoo . BlueGene/L failure analysis and prediction models. In: Proceedings of the International Conference on Dependable Systems and Networks. 2006, 425−434
112 |
A, Gainaru F, Cappello M, Snir W Kramer . Fault prediction under the microscope: a closer look into HPC systems. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. 2012, 1−11
113 |
Gainaru A, Cappello F, Kramer W. Taming of the shrew: modeling the normal and faulty behaviour of large-scale HPC systems. In: Proceedings of the 26th International Parallel and Distributed Processing Symposium. 2012, 1168−1179
114 |
A, Pelaez A, Quiroz J C, Browne E, Chuah M Parashar . Online failure prediction for HPC resources using decentralized clustering. In: Proceedings of the 21st International Conference on High Performance Computing. 2014, 1−9
115 |
H S, Gunawi R O, Suminto R, Sears C, Golliher S, Sundararaman , et al.. Fail-slow at scale: evidence of hardware performance faults in large production systems. In: Proceedings of the 16th USENIX Conference on File and Storage Technologies. 2018, 1−14
Viewed |
Full text
Cited |
Shared |
Discussed |