Please wait a minute...
Frontiers of Computer Science

ISSN 2095-2228

ISSN 2095-2236(Online)

CN 10-1014/TP

Postal Subscription Code 80-970

2018 Impact Factor: 1.129

Front. Comput. Sci.    2014, Vol. 8 Issue (3) : 378-390    https://doi.org/10.1007/s11704-014-3503-1
RESEARCH ARTICLE
Iaso: an autonomous fault-tolerant management system for supercomputers
Kai LU1,2,*(),Xiaoping WANG1,2,Gen LI2,Ruibo WANG2,Wanqing CHI2,Yongpeng LIU2,Hongwei TANG2,Hua FENG2,Yinghui GAO3
1. Science and Technology on Parallel and Distributed Processing Laboratory, National University of Defense Technology, Changsha 410073, China
2. College of Computer, National University of Defense Technology, Changsha 410073, China
3. ATR Laboratory, National University of Defense Technology, Changsha 410073, China
 Download: PDF(936 KB)  
 Export: BibTeX | EndNote | Reference Manager | ProCite | RefWorks
Abstract

With the increase of system scale, the inherent reliability of supercomputers becomes lower and lower. The cost of fault handling and task recovery increases so rapidly that the reliability issue will soon harm the usability of supercomputers. This issue is referred to as the “reliability wall”, which is regarded as a critical problem for current and future supercomputers. To address this problem, we propose an autonomous fault-tolerant system, named Iaso, in MilkyWay-2 system. Iaso introduces the concept of autonomous management in supercomputers. By autonomous management, the computer itself, rather than manpower, takes charge of the fault management work. Iaso automatically manage the whole lifecycle of faults, including fault detection, fault diagnosis, fault isolation, and task recovery. Iaso endows the autonomous features with MilkyWay-2 system, such as self-awareness, self-diagnosis, self-healing, and self-protection. With the help of Iaso, the cost of fault handling in supercomputers reduces from several hours to a few seconds. Iaso greatly improves the usability and reliability of MilkyWay-2 system.

Keywords supercomputer      autonomous management      fault tolerant      fault management      MilkyWay-2 system     
Corresponding Author(s): Kai LU   
Issue Date: 24 June 2014
 Cite this article:   
Kai LU,Xiaoping WANG,Gen LI, et al. Iaso: an autonomous fault-tolerant management system for supercomputers[J]. Front. Comput. Sci., 2014, 8(3): 378-390.
 URL:  
https://academic.hep.com.cn/fcs/EN/10.1007/s11704-014-3503-1
https://academic.hep.com.cn/fcs/EN/Y2014/V8/I3/378
1 YangX, WangZ, XueJ, ZhouY. The reliability wall for exascale supercomputing. IEEE Transactions on Computers, 2012, 61(6): 767-779
doi: 10.1109/TC.2011.106
2 LiY, LanZ. Proactive fault manager for high performance computing. In: Proceedings of the 35th International Conference on Dependable Systems and Networks (Fast Abstract). 2005
3 ShapiroMW. Self-healing in modern operating systems. Queue, 2004, 2(9): 66-75
doi: 10.1145/1039511.1039537
4 OlinerA, StearleyJ. What supercomputers say: A study of five system logs. In: Proceedings of the 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks. 2007, 575-584
doi: 10.1109/DSN.2007.103
5 SunX H, LanZ, LiY, JinH, ZhengZ. Towards a faultaware computing environment. High Availability and Performance ComputingWorkshop, 2008
6 LanZ, LiY, GujratiP, ZhengZ, ThakurR, WhiteJ. A fault diagnosis and prognosis service for teragrid clusters. In: Proceedings of Tera-Grid, 2007
7 WangX, LuoJ, LiuY, LiS, DongD. Component-based localization in sparse wireless networks. IEEE/ACM Transactions on Networking (ToN), 2011, 19(2): 540-548
8 TakemiyaH, TanakaY, SekiguchiS, OgataS, KaliaR K, NakanoA, VashishtaP. Sustainable adaptive grid supercomputing: multiscale simulation of semiconductor processing across the pacific. In: Proceedings of the ACM/IEEE SuperComputing. 2006, 23
9 WangX, LiuY, YangZ, LuK, LuoJ. OFA: an optimistic approach to conquer flip ambiguity in network localization. Computer Networks, 2013, 57(6): 1529-1544
doi: 10.1016/j.comnet.2013.02.008
10 Santos dT, Santos dL, FarinonF, HommaR, Andrade dR, KhairallaI, LemosF. Integrating heterogenous applications in control centers based on smart grid concepts. In: Proceedings of the 2013 IEEE PES Conference on Innovative Smart Grid Technologies Latin America (ISGT LA). 2013, 1-6
11 WangX, YangZ, LuoJ, ShenC. Beyond rigidity: obtain localisability with noisy ranging measurement. International Journal of Ad Hoc and Ubiquitous Computing, 2011, 8(1): 114-124
doi: 10.1504/IJAHUC.2011.041627
12 ValverdeL, RosaF, BordonsC. Design, planning and management of a hydrogen-based microgrid. IEEE Transactions on Industrial Informatics, 2013, 9(3): 1398-1404
doi: 10.1109/TII.2013.2246576
13 ZhangX, ZhouF, ZhuX, SunH, PerrigA, VasilakosA V, GuanH. DFL: Secure and practical fault localization for datacenter networks. IEEE/ACM Transactions on Networking, 2013
doi: 10.1109/TNET.2013.2274662
14 HuebscherM C, McCannJ A. A survey of autonomic computing—degrees, models, and applications. ACM Computing Surveys (CSUR), 2008, 40(3): 7:1-7:28
[1] Juan CHEN, Wenhao ZHOU, Yong DONG, Zhiyuan WANG, Chen CUI, Feihao WU, Enqiang ZHOU, Yuhua TANG. Analyzing time-dimension communication characterizations for representative scientific applications on supercomputer systems[J]. Front. Comput. Sci., 2019, 13(6): 1228-1242.
[2] Samir ZEGHLACHE,Djamel SAIGAA,Kamel KARA. Fault tolerant control based on neural network interval type-2 fuzzy sliding mode controller for octorotor UAV[J]. Front. Comput. Sci., 2016, 10(4): 657-672.
[3] Weixia XU,Yutong LU,Qiong LI,Enqiang ZHOU,Zhenlong SONG,Yong DONG,Wei ZHANG,Dengping WEI,Xiaoming ZHANG,Haitao CHEN,Jianying XING,Yuan YUAN. Hybrid hierarchy storage system in MilkyWay-2 supercomputer[J]. Front. Comput. Sci., 2014, 8(3): 367-377.
[4] Xiangke LIAO,Liquan XIAO,Canqun YANG,Yutong LU. MilkyWay-2 supercomputer: system and application[J]. Front. Comput. Sci., 2014, 8(3): 345-356.
[5] Xianghui XIE, Xing FANG, Sutai HU, Dong WU, . Evolution of supercomputers[J]. Front. Comput. Sci., 2010, 4(4): 428-436.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed