Please wait a minute...
Frontiers of Computer Science

ISSN 2095-2228

ISSN 2095-2236(Online)

CN 10-1014/TP

Postal Subscription Code 80-970

2018 Impact Factor: 1.129

Front. Comput. Sci.    2022, Vol. 16 Issue (5) : 165107    https://doi.org/10.1007/s11704-022-0625-8
REVIEW ARTICLE
Prediction of job characteristics for intelligent resource allocation in HPC systems: a survey and future directions
Zhengxiong HOU1(), Hong SHEN2, Xingshe ZHOU1, Jianhua GU1, Yunlan WANG1, Tianhai ZHAO1
1. Center for High Performance Computing, School of Computer Science, Northwestern Polytechnical University, Xi’an 710072, China
2. School of Computer Science and Engineering, Sun Yat-Sen University, Guangzhou 510275, China
 Download: PDF(2874 KB)   HTML
 Export: BibTeX | EndNote | Reference Manager | ProCite | RefWorks
Abstract

Nowadays, high-performance computing (HPC) clusters are increasingly popular. Large volumes of job logs recording many years of operation traces have been accumulated. In the same time, the HPC cloud makes it possible to access HPC services remotely. For executing applications, both HPC end-users and cloud users need to request specific resources for different workloads by themselves. As users are usually not familiar with the hardware details and software layers, as well as the performance behavior of the underlying HPC systems. It is hard for them to select optimal resource configurations in terms of performance, cost, and energy efficiency. Hence, how to provide on-demand services with intelligent resource allocation is a critical issue in the HPC community. Prediction of job characteristics plays a key role for intelligent resource allocation. This paper presents a survey of the existing work and future directions for prediction of job characteristics for intelligent resource allocation in HPC systems. We first review the existing techniques in obtaining performance and energy consumption data of jobs. Then we survey the techniques for single-objective oriented predictions on runtime, queue time, power and energy consumption, cost and optimal resource configuration for input jobs, as well as multi-objective oriented predictions. We conclude after discussing future trends, research challenges and possible solutions towards intelligent resource allocation in HPC systems.

Keywords high-performance computing      performance prediction      job characteristics      intelligent resource allocation      cloud computing      machine learning     
Corresponding Author(s): Zhengxiong HOU   
Just Accepted Date: 07 January 2022   Issue Date: 19 May 2022
 Cite this article:   
Zhengxiong HOU,Hong SHEN,Xingshe ZHOU, et al. Prediction of job characteristics for intelligent resource allocation in HPC systems: a survey and future directions[J]. Front. Comput. Sci., 2022, 16(5): 165107.
 URL:  
https://academic.hep.com.cn/fcs/EN/10.1007/s11704-022-0625-8
https://academic.hep.com.cn/fcs/EN/Y2022/V16/I5/165107
Fig.1  Resource allocation for typical jobs in HPC systems
Objectives User Provider
Performance/makespan/job throughput Yes Yes
Fairness Yes Yes
Resilience/fault-tolerance Yes Yes
Cost Yes
Power/energy/thermal/carbon emission Yes
Resource utilization/load balance Yes
Profit Yes
Tab.1  Optimization objectives of intelligent resource allocation in HPC systems
Fig.2  Prediction of job characteristics for intelligent resource allocation in HPC systems
Typical systems Resource monitoring Job logs (difference with SWF [1]) Open source and commercial
High throughput condor Hostname, o.s., architecture, state, activity, average load, memory, activity time Resource name and ID, keyword Open source
LSF (IBM) Hostname, status, r15s, r1m, r15m, ut, pg, ls, it, tmp, swp, mem, io Project, command, work directory, submit host, output file, error file, execute host, job name Commercial
SLURM Partition, status, time limit, # of nodes, node list Job name, number of used nodes, nodelist Open source
PBS/Torque Hostname, state, # of CPU cores, type, running jobs, load, physical memory, available memory, idle time, # of users, # of sessions Job name, start time, end time, submit host, execute host Commercial for PBS PRO; Open source for OpenPBS and Torque
Grid engine (Sun/Oracle grid engine) Hostname, architecture, number of CPU/socket/cores/thread, load, total memory, used memory, total swap, used swap Job name, CPU usage, io Open source
HPC Pack (Microsoft) Cores/memory/disk, processors, Cores, sockets, cores in use, CPU usage, affinity, status, workload, network Job name, job template, project, priority, requested nodes, cores per node, licenses, environment variables, depends on jobs Commercial
Tab.2  Overview of resource monitoring and job logs in popular systems
Prediction methods Refs. Key techniques and inputs Experimental data & platform Prediction accuracy Prediction stability Difficulty & cost
Categorizing & statistics [46] Categorizing similar jobs template according to the executable name, degree of parallelism, and user name Log files from 3 parallel computers Coarse Normal Low
[47] A genetic algorithm evolving template attributes 4 workloads recorded from parallel computers 41%?71% Normal Normal
Application modeling [48] Stochastic values are used to parameterize performance models Different workloads on a contended network of workstations 70%?95% Normal Normal
[50] Hidden markov model based on historical running time Traces from several parallel computers 66.4%?99% Normal High
System modeling [52] Average runtime of the last two jobs by the same user is used for runtime estimation of a new job 4 traces from 4 parallel computers; an event-based simulation of scheduling 31%?62% Normal Low
[63] Based on the time stamp of simulation time, current time, and the specification of the job Fire dynamics simulator; cloud ?90% High Normal
[64] Analysis based on Amdahl’s law NPB; traditional HPC cluster and a private HPC cloud low Normal Normal
[65] Modeling the execution of an MPI job on a cluster using a queueing network 5 SPEC-MPI, 5 NPB; a cluster of bare-metal servers and virtual machines 88% High High
ML: HBNN [55] Running information of history jobs, job input parameters The parallel workload archive; simulation with pyss High Normal Normal
LR [54] Input parameters and machine information (e.g., disk speed) BLAST and RAxML; 4 multi-core clusters 51.8%?89.2% Normal Normal
[57] Attributes of the current job and history jobs 4 real job logs; simulations 62.1%?82.3% High Low
Polynomial model [56] Runtime and requested resources of history jobs by the user 6 real job logs; simulations Normal Normal Normal
RF [58] Task submission information Trace-driven simulation; clusters using Condor Normal High High
EL [59] Ensemble learning, LightGBM algorithm; RF, SVR, BRR, Bayesian model 3 job logs in HPC systems; VASP jobs on an HPC cluster Normal Normal Normal
Tobit model [4] Tobit regression based TRIP algorithm Workload traces from two IBM Blue Gene supercomputers 75%?80% Normal Normal
SVM [60, 61] Categorization and instance learning 2 real job logs (HPC2N04, ANL09) 70%?80% Normal High
kNN [12] Job submission information and free processors (it considers the uncertainty of the predictions) Parallel job traces from real Supercomputing centers; hybrid on-premise cluster and HPC cloud High Normal High
RST [66] Rough set theory two computational jobs; multiple cloud platforms High Normal High
Tab.3  Overview of job runtime prediction methods for clusters, supercomputers and HPC clouds
Prediction methods Refs. Key techniques/general comments Experimental data & platform Prediction accuracy Prediction stability Difficulty & cost
Runtime estimates [67] Predict wait time by runtime predictions based on history jobs Workload traces from 3 supercomputer centers Low Normal Normal
Statistical approach [68] Fit a statistical distribution to history jobs and use the distribution quantile of interest as the predictor for the next job Batch jobs from 11 clusters over a 9-year period Normal High Normal
Binomial method [69] Estimate an upper bound for the queuing delay with a quantified confidence level 7 archival job logs covering a 9-year period from large HPC centers Normal Normal High
[70] Predict quantiles directly based on wait time of history jobs Workflow on 5 supercomputers Normal Normal High
[68,71,72] Estimate queue bounds from time series, predict quantile based on nonparametric inference History jobs from 11 clusters over a 9-year period Normal High Normal
Qespera [73,74] Spatial clustering using information of history jobs Feitelson’s parallel workloads archive from parallel computers Most errors are less than 1 hour Normal High
Tab.4  Overview of job queue time prediction methods for clusters and supercomputers
Prediction methods Refs. Key techniques and inputs Experimental workload and platform Prediction accuracy Prediction stability Difficulty & cost
Powermodeling [40] Instance-based regression model using submitted data Logs of SLURM submission data of 12476 jobs on a COBALT supercomputer Normal Normal Normal
[82] Regression models based on CPU utilization, IPC, MPC, and performance counters Four benchmarks from the SPEC2000 suite; virtualized multi-core server >90% normal High
[8386] Multi-variable regression based on power of CPU, memory, disk, etc. Wave2D, Jacobi2D, HPL; multi-core cluster High High Normal
[92] Regression with SVR Trace data from a prototype HPC system; heterogeneous clusters (CPU+GPU+MIC) >80% Good Normal High
[93] Approximate the energy usage arithmetic operations and memory operations Two simulation grids related to weather simulations; a CPU+GPU node >90% High High
Powerprofiling [2] Estimate job power profiles based on monitoring power consumption of jobs Trace-based simulations with one year of logs from an IBM Blue Gene/Q system 94% High Normal
[88] Using power profiles job traces of different applications; clusters with hardware overprovisioning 87% High High
[89] Constant power Wave2D, Jacobi2D, LeanMD, Lulesh, AMR; a 38-node Dell cluster Normal High Low
[90] Consistent value Simulation and data from Blue Gene/Q machine Mira Normal High Low
[91] Using power profiles of characterized applications 10 workloads of 1000 jobs in the field of linear algebra; hybrid CPU and GPU Normal Normal Normal
RF [108] Performance events, such as CPI, power of dram and processors. NPBs and two co-design mini-applications; a dual-processor node Processor 97.7%; DRAM 92% High High
Tab.5  Overview of power/energy consumption prediction methods for jobs in clusters and supercomputers
Prediction methods Refs. Optimization goal Key techniques and inputs Experimental application & platform Prediction accuracy Prediction stability Difficulty & cost
ANN [100] O#TP Code, data, and runtime features extracted from profiling executions 20 programs from UTDSP, NPB, Mibench; multi-core Normal (<96-97%) Normal High
LR [109] O#TP Collected data from hardware counters 10 NPB benchmarks using OpenMP, MM5; multi-core median (87.4%) High Normal
PLR [101] O#TE, DVFS Workload characteristics and resource information (ave. core temperature, frequency, stalls, etc.) parallel workloads from the PARSEC suite; quad-core processor >87.4% Normal Normal
Lightweight DNN [102] O#TE Performance events, execution time and average power MiBench, IoMT, Core-Mark workloads; ARM multicore Up to 97% High High
BPI model [103] O#TP Bytes per instruction model and least squares approach NPB; MIC (Many-core) Average 93.2% High High
Amdahl’s law and regression analysis [104] O#TE Regression analysis and the least squares method on the basis of the Amdahl’s law ten programs from the PARSEC suite; MIC (Many-core) Normal Normal Normal
General model [105] O#TP, O#TE DVFS, #of MIC Modeling runtime in the offload mode, power and energy consumption of all devices CoMD proxy application; one and multiple nodes using MIC (Many-core) >90% Normal Normal
Amdahl’s law [11], [41] O#TP, O#TE DVFS Power aware speedup model accounting for parallel overhead NPB; a 16-node DVS-enabled cluster Normal Normal Normal
Knowledge and experience [31] DVFS Knowledge (decision tree) and experience based prediction Linpack, SFM, NPB; a HP cluster with 5 nodes Normal Normal Normal
RF [106] DVFS Global extensible open power manager and dataleft database Coral-2 suit; CooLMUC-3 cluster system 94% High High
Least squares regression [111] O#TP, O#TE, DVFS Number of computer nodes, number of cores and threads, CPU frequency and DVFS settings Stencil, Transpose, AMG and LAMMPS; a Cray system with 86 44-core compute nodes >90% Normal Normal
Amdahl’s and Gustafson’s law [64] O#TP, O#TE Asymptotic complexity models, timing models and separate communication from computation NPB; a traditional HPC cluster and a private HPC cloud Low Normal Normal
Exponential smoothing [114] O#VMC Predict the future workload weekly by extending exponential smoothing method for time series data parallel workloads archive and social network population traces; SLURM, Amazon EC2 and Google Cloud Engine Good Normal High
RF; LR, ANN [6] O#TP, O#TC, OMP/OMC About 200 (out of 935) features of operation mix, instruction level Parallelism, reuse distance, library calls, communication requirements NASA parallel benchmarks (NPB); Openstack >88% (CP); >70% (PP) High High
Tab.6  Overview of optimal resource configuration prediction methods in HPC systems
Fig.3  Future research directions for intelligent resource allocation in HPC systems
Concerns Current Future
Job logs Performance and resource +energy and cost
Features Application and resource characteristics of history jobs +online hardware counters
# of jobs One job +multiple jobs
Objective Runtime +energy and cost
Accuracy Accurate More accurate
Overhead It depends Less time overhead
Tab.7  Some main concerns between current and future prediction of job characteristics in HPC systems
  
  
  
  
  
  
1 D G Feitelson , D Tsafrir , D Krakov . Experience with using the parallel workloads archive. Journal of Parallel and Distributed Computing, 2014, 74( 10): 2967– 2982
2 S Wallace X Yang V Vishwanath W E Allcock S Coghlan M E Papka Z Lan. A data driven scheduling approach for power management on HPC systems. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 2016, 56
3 Y Tsujita A Uno R Sekizaw K Yamamoto F Sueyasu. Job classification through long-term log analysis towards power-aware HPC system operation. In: Proceedings of the 29th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP). 2021, 26– 34
4 Y Fan P Rich W E Allcock M E Papka Z Lan. Trade-off between prediction accuracy and underestimation rate in job runtime estimates. In: Proceedings of the 2017 IEEE International Conference on Cluster Computing (CLUSTER). 2017, 530– 540
5 M A S Netto , R N Calheiros , E R Rodrigues , R L F Cunha , R Buyya . HPC cloud for scientific and business applications: taxonomy, vision, and research challenges. ACM Computing Surveys, 2019, 51( 1): 8
6 G Mariani , A Anghel , R Jongerius , G Dittmann . Predicting cloud performance for HPC applications before deployment. Future Generation Computer Systems, 2018, 87: 618– 628
7 A C Orgerie , M D De Assuncao , L Lefevre . A survey on techniques for improving the energy efficiency of large-scale distributed systems. ACM Computing Surveys, 2014, 46( 4): 47
8 A H Kelechi , M H Alsharif , O J Bameyi , P J Ezra , I K Joseph , A A Atayero , Z W Geem , J Hong . Artificial intelligence: an energy efficiency tool for enhanced high performance computing. Symmetry, 2020, 12( 6): 1029
9 E D Wang. High Productivity Computing System: Design and Applications. China Science Publishing & Media Ltd, 2014
10 S Prabhakaran. Dynamic resource management and job scheduling for high performance computing. Technische Universität Darmstadt, Dissertation, 2016
11 R Ge K W Cameron. Power-aware speedup. In: Proceedings of the 2017 IEEE International Parallel and Distributed Processing Symposium. 2007, 1– 10
12 R L F Cunha , E R Rodrigues , L P Tizzei , M A S Netto . Job placement advisor based on turnaround predictions for HPC hybrid clouds. Future Generation Computer Systems, 2017, 67: 35– 46
13 A F Leite , A Boukerche , A C M A De Melo , C Eisenbeis , C Tadonki , C G Ralha . Power-aware server consolidation for federated clouds. Concurrency and Computation: Practice and Experience, 2016, 28( 12): 3427– 3444
14 L Yu , Z Zhou , Y Fan , M E Papka , Z Lan . System-wide trade-off modeling of performance, power, and resilience on petascale systems. The Journal of Supercomputing, 2018, 74( 7): 3168– 3192
15 S Blagodurov A Fedorova E Vinnik T Dwyer F Hermenier. Multi-objective job placement in clusters. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 2015, 66
16 A N Toosi , R N Calheiros , R Buyya . Interconnected cloud computing environments: challenges, taxonomy, and survey. ACM Computing Surveys, 2014, 47( 1): 7
17 Z Hou , Y Wang , Y Sui , J Gu , T Zhao , X Zhou . Managing high-performance computing applications as an on-demand service on federated clouds. Computers & Electrical Engineering, 2018, 67: 579– 595
18 H Hussain , S U R Malik , A Hameed , S U Khan , G Bickler , N Min-Allah , M B Qureshi , L Zhang , Y Wang , N Ghani , J Kolodziej , A Y Zomaya , C Z Xu , P Balaji , A Vishnu , F Pinel , J E Pecero , D Kliazovich , P Bouvry , H Li , L Wang , D Chen , A Rayes . A survey on resource allocation in high performance distributed computing systems. Parallel Computing, 2013, 39( 11): 709– 736
19 M L Massie , B N Chun , D E Culler . The ganglia distributed monitoring system: design, implementation, and experience. Parallel Computing, 2004, 30( 7): 817– 840
20 W Allcock P Rich Y Fan Z Lan. Experience and practice of batch scheduling on leadership supercomputers at Argonne. In: Proceedings of 21st Job Scheduling Strategies for Parallel Processing. 2017, 1− 24
21 J Yoon , T Hong , C Park , S Y Noh , H Yu . Log analysis-based resource and execution time improvement in HPC: a case study. Applied Sciences, 2020, 10( 7): 2634
22 S Islam , J Keung , K Lee , A Liu . Empirical prediction models for adaptive resource provisioning in the cloud. Future Generation Computer Systems, 2012, 28( 1): 155– 162
23 E Cortez A Bonde A Muzio M Russinovich M Fontoura R Bianchini. Resource central: understanding and predicting workloads for improved resource management in large cloud platforms. In: Proceedings of the 26th Symposium on Operating Systems Principles. 2017, 153− 167
24 A Marowka. On performance analysis of a multithreaded application parallelized by different programming models using Intel VTune. In: Proceedings of the 11th International Conference on Parallel Computing Technologies. 2011, 317− 331
25 D Terpstra H Jagode H You J Dongarra. Collecting performance data with PAPI-C. In: Proceedings of the 3rd International Workshop on Parallel Tools for High Performance Computing. 2009, 157− 173
26 M Dimakopoulou S Eranian N Koziris N Bambos. Reliable and efficient performance monitoring in Linux. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 2016, 396− 408
27 V M Weaver. Self-monitoring Overhead of the Linux perf_event performance counter interface. In: Proceedings of the 2015 IEEE International Symposium on Performance Analysis of Systems and Software. 2015, 102− 111
28 J Treibig G Hager G Wellein. LIKWID: a lightweight performance-oriented tool suite for x86 multicore environments. In: Proceedings of the 39th International Conference on Parallel Processing Workshops. 2010, 207− 216
29 C Pospiech. Hardware performance monitor (HPM) toolkit users guide. Advanced Computing Technology Center, IBM Research. See researcher.watson.ibm.com/researcher/files/us-hfwen/HPM_ug.pdf website, 2008
30 Y Georgiou D Glesser K Rzadca D Trystram. A scheduler-level incentive mechanism for energy efficiency in HPC. In: Proceedings of the 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing. 2015, 617− 626
31 H V Raghu S K Saurav B S Bapu. PAAS: power aware algorithm for scheduling in high performance computing. In: Proceedings of the 6th IEEE/ACM International Conference on Utility and Cloud Computing. 2013, 327− 332
32 S Wallace V Vishwanath S Coghlan J Tramm Z Lan M E Papka. Application power profiling on IBM Blue Gene/Q. In: Proceedings of the 2013 IEEE International Conference on Cluster Computing (CLUSTER). 2013, 1− 8
33 S Browne , J Dongarra , N Garner , G Ho , P Mucci . A portable programming interface for performance evaluation on modern processors. The International Journal of High Performance Computing Applications, 2000, 14( 3): 189– 204
34 M Rashti G Sabin D Vansickle B Norris. WattProf: a flexible platform for fine-grained HPC power profiling. In: Proceedings of the 2015 IEEE International Conference on Cluster Computing. 2015, 698− 705
35 J H Laros D DeBonis R E Grant S M Kelly M Levenhagen S Olivier K Pedretti. High performance computing-power application programming interface specification, version 1.2. See cfwebprod.sandia.gov/cfdocs/CompResearch/docs/PowerAPI_SAND_V1.1a(3).pdf website, 2016
36 R Kavanagh , K Djemame . Rapid and accurate energy models through calibration with IPMI and RAPL. Concurrency and Computation: Practice and Experience, 2019, 31( 13): e5124
37 V M Weaver M Johnson K Kasichayanula J Ralph P Luszczek D Terpstra S Moore. Measuring energy and power with PAPI. In: Proceedings of the 41st International Conference on Parallel Processing Workshops. 2012, 262− 268
38 E Rotem , A Naveh , A Ananthakrishnan , E Weissmann , D Rajwan . Power-management architecture of the Intel microarchitecture code-named Sandy Bridge. IEEE Micro, 2012, 32( 2): 20– 27
39 J Leng T Hetherington A ElTantawy S Gilani N S Kim T M Aamodt V J Reddi. GPUwattch: enabling energy optimizations in GPGPUs. In: Proceedings of the 40th Annual International Symposium on Computer Architecture. 2013, 487− 498
40 T Saillant J C Weill M Mougeot. Predicting job power consumption based on RJMS submission data in HPC systems. In: Proceedings of the 35th International Conference on High Performance Computing. 2020, 63− 82
41 C Jin , B R De Supinski , D Abramson , H Poxon , L DeRose , M N Dinh , M Endrei , E R Jessup . A survey on software methods to improve the energy efficiency of parallel computing. The International Journal of High Performance Computing Applications, 2017, 31( 6): 517– 549
42 Y Georgiou T Cadeau D Glesser D Auble M Jette M Hautreux. Energy accounting and control with SLURM resource and job management system. In: Proceedings of the 15th International Conference on Distributed Computing and Networking. 2014, 96− 118
43 S J Martin D Rush M Kappel. Cray advanced platform monitoring and control. In: Proceedings of the Cray User Group Meeting, Chicago, IL. website, 2015, 26− 30
44 D Thain , T Tannenbaum , M Livny . Distributed computing in practice: the Condor experience. Concurrency and Computation: Practice and Experience, 2005, 17( 2-4): 323– 356
45 A B Yoo M A Jette M Grondona. SLURM: simple Linux utility for resource management. In: Proceedings of the 9th Workshop on Job Scheduling Strategies for Parallel Processing. 2003, 44− 60
46 R Gibbons. A historical application profiler for use by parallel schedulers. In: Proceedings of Workshop on Job Scheduling Strategies for Parallel Processing. 1997, 58− 77
47 W Smith , I Foster , V Taylor . Predicting application run times with historical information. Journal of Parallel and Distributed Computing, 2004, 64( 9): 1007– 1016
48 J M Schopf F Berman. Using stochastic intervals to predict application behavior on contended resources. In: Proceedings of the Fourth International Symposium on Parallel Architectures, Algorithms, and Networks. 1999, 344− 349
49 C L Mendes D A Reed. Integrated compilation and scalability analysis for parallel systems. In: Proceedings of the 1998 International Conference on Parallel Architectures and Compilation Techniques. 1998, 385− 392
50 A Nissimov. Locality and its usage in parallel job runtime distribution modeling using HMM. Hebrew University, Dissertation, 2006
51 L R Rabiner . A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 1989, 77( 2): 257– 286
52 D Tsafrir , Y Etsion , D G Feitelson . Backfilling using system-generated predictions rather than user runtime estimates. IEEE Transactions on Parallel and Distributed Systems, 2007, 18( 6): 789– 803
53 Z Hou S Zhao C Yin Y Wang J Gu X Zhou. Machine learning based performance analysis and prediction of jobs on a HPC cluster. In: Proceedings of the 20th International Conference on Parallel and Distributed Computing, Applications and Technologies (PDCAT). 2019, 247− 252
54 A Matsunaga J A B Fortes. On the use of machine learning to predict the time and resources consumed by applications. In: Proceedings of the 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing. 2010, 495− 504
55 R Duan F Nadeem J Wang Y Zhang R Prodan T Fahringer. A hybrid intelligent method for performance modeling and prediction of workflow activities in grids. In: Proceedings of the 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid. 2009, 339− 347
56 E Gaussier D Glesser V Reis D Trystram. Improving backfilling by using machine learning to predict running times. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 2015, 1− 10
57 J Li , X Zhang , L Han , Z Ji , X Dong , C Hu . OKCM: improving parallel task scheduling in high-performance computing systems using online learning. The Journal of Supercomputing, 2021, 77( 6): 5960– 5983
58 A S McGough N A Moubayed M Forshaw. Using machine learning in trace-driven energy-aware simulations of high-throughput computing systems. In: Proceedings of the 8th ACM/SPEC on International Conference on Performance Engineering Companion. 2017, 55− 60
59 X Chen H Zhang H Bai C Yang X Zhao B Li. Runtime prediction of high-performance computing jobs based on ensemble learning. In: Proceedings of the 4th International Conference on High Performance Compilation, Computing and Communications. 2020, 56− 62
60 G B Wu Y Shen W S Zhang S S Liao Q Q Wang J Li. Runtime prediction of jobs for backfilling optimization. Journal of Chinese Computer Systems (in Chinese), 2019, 40(1): 6− 12
61 Y H Xiao L F Xu M Xiong. GA-Sim: a job running time prediction algorithm based on categorization and instance learning. Computer Engineering & Science (in Chinese), 2019, 41(6): 987− 992
62 M Parashar , M AbdelBaky , I Rodero , A Devarakonda . Cloud paradigms and practices for computational and data-enabled science and engineering. Computing in Science & Engineering, 2013, 15( 4): 10– 18
63 X Li H Palit Y S Foo T Hung. Building an HPC-as-a-service toolkit for user-interactive HPC services in the cloud. In: Proceedings of the 2011 IEEE Workshops of International Conference on Advanced Information Networking and Applications. 2011, 369− 374
64 J Y Shi M Taifi A Pradeep A Khreishah V Antony. Program scalability analysis for HPC cloud: applying Amdahl’s law to NAS benchmarks. In: Proceedings of the 2012 SC Companion: High Performance Computing, Networking Storage and Analysis. 2012, 1215− 1225
65 A Saad , A El-Mahdy . HPCCloud seer: a performance model based predictor for parallel applications on the cloud. IEEE Access, 2020, 8: 87978– 87993
66 C T Fan Y S Chang W J Wang S M Yuan. Execution time prediction using rough set theory in hybrid cloud. In: Proceedings of the 9th International Conference on Ubiquitous Intelligence and Computing and 9th International Conference on Autonomic and Trusted Computing. 2012, 729− 734
67 W Smith V E Taylor I T Foster. Using run-time predictions to estimate queue wait times and improve scheduler performance. In: Proceedings of the Job Scheduling Strategies for Parallel Processing. 1999, 202− 219
68 D Nurmi J Brevik R Wolski. QBETS: queue bounds estimation from time series. In: Proceedings of the 13th Workshop on Job Scheduling Strategies for Parallel Processing. 2007, 76− 101
69 J Brevik D Nurmi R Wolski. Predicting bounds on queuing delay for batch-scheduled parallel machines. In: Proceedings of the 11th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 2006, 110− 118
70 D Nurmi A Mandal J Brevik C Koelbel R Wolski K Kennedy. Evaluation of a workflow scheduler using integrated performance modelling and batch queue wait time prediction. In: Proceedings of the 2006 ACM/IEEE Conference on Supercomputing. 2006, 29
71 M A S Netto , R L F Cunha , N Sultanum . Deciding when and how to move HPC jobs to the cloud. Computer, 2015, 48( 11): 86– 89
72 W Smith. A service for queue prediction and job statistics. In : Proceedings of the 2010 Gateway Computing Environments Workshop (GCE). 2010, 1− 8
73 P Murali , S Vadhiyar . Qespera: an adaptive framework for prediction of queue waiting times in supercomputer systems. Concurrency and Computation: Practice and Experience, 2016, 28( 9): 2685– 2710
74 P Murali , S Vadhiyar . Metascheduling of HPC jobs in day-ahead electricity markets. IEEE Transactions on Parallel and Distributed Systems, 2018, 29( 3): 614– 627
75 E N Elnozahy M Kistler R Rajamony. Energy-efficient server clusters. In: Proceedings of the 2nd International Workshop on Power-aware Computer Systems. 2002, 179− 197
76 B Lawson E Smirni. Power-aware resource allocation in high-end systems via online simulation. In: Proceedings of the 19th Annual International Conference on Supercomputing. 2005, 229− 238
77 M Etinski J Corbalan J Labarta M Valero. Optimizing job performance under a given power constraint in HPC centers. In: Proceedings of the International Conference on Green Computing. 2010, 257− 267
78 M Etinski , J Corbalan , J Labarta , M Valero . Parallel job scheduling for power constrained HPC systems. Parallel Computing, 2012, 38( 12): 615– 630
79 O Mämmelä , M Majanen , R Basmadjian , Meer H De , A Giesler , W Homberg . Energy-aware job scheduler for high-performance computing. Computer Science - Research and Development, 2012, 27( 4): 265– 275
80 Z Zhou Z Lan W Tang N Desai. Reducing energy costs for IBM Blue Gene/P via power-aware job scheduling. In: Proceedings of the 17th Workshop on Job Scheduling Strategies for Parallel Processing. 2014, 96− 115
81 A Marathe P E Bailey D K Lowenthal B Rountree M Schulz B R De Supinski. A run-time system for power-constrained HPC applications. In: Proceedings of the 30th International Conference on High Performance Computing. 2015, 394− 408
82 G Dhiman K Mihic T Rosing. A system for online power prediction in virtualized environments using gaussian mixture models. In: Proceedings of the 47th Design Automation Conference. 2010, 807− 812
83 R Basmadjian H De Meer. Evaluating and modeling power consumption of multi-core processors. In: Proceedings of the 3rd International Conference on Future Systems: Where Energy, Computing and Communication Meet (e-Energy). 2012, 1− 10
84 R Basmadjian G D Costa G L T Chetsa L Lefevre A Oleksiak J M Pierson. Energy-aware approaches for HPC systems. In: Jeannot E, Žilinskas J, eds. High-Performance Computing on Complex Environments. Hoboken: John Wiley & Sons, Inc, 2014
85 B Subramaniam W C Feng. Statistical power and performance modeling for optimizing the energy efficiency of scientific computing. In: Proceedings of the 2010 IEEE/ACM Int’l Conference on Green Computing and Communications & Int'l Conference on Cyber, Physical and Social Computing. 2010, 139− 146
86 L K John L Eeckhout. Performance Evaluation and Benchmarking. New York: CRC Press, 2005
87 T Patki D K Lowenthal B Rountree M Schulz B R De Supinski. Exploring hardware overprovisioning in power-constrained, high performance computing. In: Proceedings of the 27th International ACM Conference on International Conference on Supercomputing. 2013, 173− 182
88 T Patki D K Lowenthal A Sasidharan M Maiterth B L Rountree M Schulz B R De Supinski. Practical resource management in power-constrained, high performance computing. In: Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing. 2015, 121− 132
89 O Sarood A Langer A Gupta L Kale. Maximizing throughput of overprovisioned HPC data centers under a strict power budget. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 2014, 807− 818
90 D A Ellsworth A D Malony B Rountree M Schulz. Dynamic power sharing for higher job throughput. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 2015, 80
91 M Chiesi , L Vanzolini , C Mucci , E F Scarselli , R Guerrieri . Power-aware job scheduling on heterogeneous multicore architectures. IEEE Transactions on Parallel and Distributed Systems, 2015, 26( 3): 868– 877
92 A Sîrbu O Babaoglu. Power consumption modeling and prediction in a hybrid CPU-GPU-MIC supercomputer. In: Proceedings of the 22nd European Conference on Parallel Processing. 2016, 117− 130
93 M Ciznicki , K Kurowski , J Weglarz . Energy aware scheduling model and online heuristics for stencil codes on heterogeneous computing architectures. Cluster Computing, 2017, 20( 3): 2535– 2549
94 M Dayarathna , Y Wen , R Fan . Data center energy consumption modeling: a survey. IEEE Communications Surveys & Tutorials, 2016, 18( 1): 732– 794
95 E K Lee H Viswanathan D Pompili. VMAP: proactive thermal-aware virtual machine allocation in HPC cloud datacenters. In: Proceedings of the 19th International Conference on High Performance Computing. 2012, 1− 10
96 R Aversa B Di Martino M Rak S Venticinque U Villano. Performance prediction for HPC on clouds. In: Buyya R, Broberg J, Goscinski A, eds. Cloud Computing: Principles and Paradigms. Hoboken: John Wiley & Sons, Inc, 2011
97 M Liu Y Jin J Zhai Y Zha Q Shi X Ma W Chen. ACIC: automatic cloud I/O configurator for HPC applications. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. 2013, 1− 12
98 M Rak , M Turtur , U Villano . Early prediction of the cost of cloud usage for HPC applications. Scalable Computing: Practice and Experience, 2015, 16( 3): 303– 320
99 A Geist , D A Reed . A survey of high-performance computing scaling challenges. The International Journal of High Performance Computing Applications, 2017, 31( 1): 104– 113
100 Z Wang M F P O’Boyle. Mapping parallelism to multi-cores: a machine learning based approach. In: Proceedings of the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 2009, 75− 84
101 R Cochran C Hankendi A Coskun S Reda. Identifying the optimal energy-efficient operating points of parallel workloads. In: Proceedings of the 2011 IEEE/ACM International Conference on Computer-Aided Design (ICCAD). 2011, 608− 615
102 B Gomatheeshwari , J Selvakumar . Appropriate allocation of workloads on performance asymmetric multicore architectures via deep learning algorithms. Microprocessors and Microsystems, 2020, 73: 102996
103 X Bai , E Wang , X Dong , X Zhang . A scalability prediction approach for multi-threaded applications on manycore processors. The Journal of Supercomputing, 2015, 71( 11): 4072– 4094
104 T Ju W Wu H Chen Z Zhu X Dong. Thread count prediction model: dynamically adjusting threads for heterogeneous many-core systems. In: Proceedings of the 21st IEEE International Conference on Parallel and Distributed Systems. 2015, 456− 464
105 G Lawson V Sundriyal M Sosonkina Y Shen. Modeling performance and energy for applications offloaded to Intel Xeon Phi. In: Proceedings of the 2nd International Workshop on Hardware-Software Co-Design for High Performance Computing. 2015, 7
106 G Ozer S Garg N Davoudi G Poerwawinata M Maiterth A Netti D Tafani. Towards a predictive energy model for HPC runtime systems using supervised learning. In: Proceedings of the European Conference on Parallel Processing. 2019, 626− 638
107 S Niu , J Zhai , X Ma , X Tang , W Chen , W Zheng . Building semi-elastic virtual clusters for cost-effective HPC cloud resource provisioning. IEEE Transactions on Parallel and Distributed Systems, 2016, 27( 7): 1915– 1928
108 P Balaprakash A Tiwari S M Wild L Carrington P D Hovland. AutoMOMML: automatic multi-objective modeling with machine learning. In: Proceedings of the 31st International Conference on High Performance Computing. 2016, 219− 239
109 M Curtis-Maury , F Blagojevic , C D Antonopoulos , D S Nikolopoulos . Prediction-based power-performance adaptation of multithreaded scientific codes. IEEE Transactions on Parallel and Distributed Systems, 2008, 19( 10): 1396– 1410
110 D De Sensi. Predicting performance and power consumption of parallel applications. In: Proceedings of the 24th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP). 2016, 200− 207
111 M Endrei C Jin M N Dinh D Abramson H Poxon L DeRose B R De Supinski. Energy efficiency modeling of parallel applications. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 2018, 212− 224
112 R R Manumachu , A Lastovetsky . Bi-objective optimization of data-parallel applications on homogeneous multicore clusters for performance and energy. IEEE Transactions on Computers, 2018, 67( 2): 160– 177
113 M Hao , W Zhang , Y Wang , G Lu , F Wang , A V Vasilakos . Fine-grained powercap allocation for power-constrained systems based on multi-objective machine learning. IEEE Transactions on Parallel and Distributed Systems, 2021, 32( 7): 1789– 1801
114 T Scogland J Azose D Rohr S Rivoire N Bates D Hackenberg. Node Variability in Large-Scale Power Measurements: perspectives from the Green500, Top500 and EEHPCWG. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 2015, 1− 11
115 I Foster Y Zhao I Raicu S Lu. Cloud computing and grid computing 360-degree compared. In: Proceedings of the 2008 Grid Computing Environments Workshop. 2008, 1− 10
116 S Seneviratne S Witharana. A survey on methodologies for runtime prediction on grid environments. In: Proceedings of the 7th International Conference on Information and Automation for Sustainability. 2014, 1− 6
117 Q Yang , Y Liu , T Chen , Y Tong . Federated machine learning: concept and applications. ACM Transactions on Intelligent Systems and Technology, 2019, 10( 2): 12
118 T Ben-Nun , T Hoefler . Demystifying parallel and distributed deep learning: an in-depth concurrency analysis. ACM Computing Surveys, 2020, 52( 4): 65
119 C Li , H Sun , H Tang , Y Luo . Adaptive resource allocation based on the billing granularity in edge-cloud architecture. Computer Communications, 2019, 145: 29– 42
120 A I Orhean , F Pop , I Raicu . New scheduling approach using reinforcement learning for heterogeneous distributed systems. Journal of Parallel and Distributed Computing, 2018, 117: 292– 302
121 C L P Chen , Z Liu . Broad learning system: an effective and efficient incremental learning system without the need for deep architecture. IEEE Transactions on Neural Networks and Learning Systems, 2018, 29( 1): 10– 24
122 M Naghshnejad , M Singhal . A hybrid scheduling platform: a runtime prediction reliability aware scheduling platform to improve HPC scheduling performance. The Journal of Supercomputing, 2020, 76( 1): 122– 149
123 D Ye , D Z Chen , G Zhang . Online scheduling of moldable parallel tasks. Journal of Scheduling, 2018, 21( 6): 647– 654
124 J J Dongarra H D Simon. High performance computing in the US in 1995 - An analysis on the basis of the TOP500 list. Supercomputer, 1997, 13(1): 19− 28
125 W C Feng , K W Cameron . The Green500 list: encouraging sustainable supercomputing. Computer, 2007, 40( 12): 50– 55
126 S Wienke H Iliev D A Mey M S Muller. Modeling the productivity of HPC systems on a computing center scale. In: Proceedings of the 30th International Conference on High Performance Computing. 2015, 358− 375
127 J Dongarra , R Graybill , W Harrod , R Lucas , E Lusk , P Luszczek , J Mcmahon , A Snavely , J Vetter , K Yelick , S Alam , R Campbell , L Carrington , T Y Chen , O Khalili , J Meredith , M Tikir . DARPA’s HPCS program: history, models, tools, languages. Advances in Computers, 2008, 72: 1– 100
[1] Jie JIA, Yi LIU, Guozhen ZHANG, Yulin GAO, Depei QIAN. Software approaches for resilience of high performance computing systems: a survey[J]. Front. Comput. Sci., 2023, 17(4): 174105-.
[2] Lerina AVERSANO, Mario Luca BERNARDI, Marta CIMITILE, Martina IAMMARINO, Debora MONTANO. Forecasting technical debt evolution in software systems: an empirical study[J]. Front. Comput. Sci., 2023, 17(3): 173210-.
[3] Sedigheh KHOSHNEVIS. A search-based identification of variable microservices for enterprise SaaS[J]. Front. Comput. Sci., 2023, 17(3): 173208-.
[4] Changbo KE, Fu XIAO, Zhiqiu HUANG, Fangxiong XIAO. A user requirements-oriented privacy policy self-adaption scheme in cloud computing[J]. Front. Comput. Sci., 2023, 17(2): 172203-.
[5] Rong ZENG, Xiaofeng HOU, Lu ZHANG, Chao LI, Wenli ZHENG, Minyi GUO. Performance optimization for cloud computing systems in the microservice era: state-of-the-art and research opportunities[J]. Front. Comput. Sci., 2022, 16(6): 166106-.
[6] Yao SONG, Limin XIAO, Liang WANG, Guangjun QIN, Bing WEI, Baicheng YAN, Chenhao ZHANG. GCSS: a global collaborative scheduling strategy for wide-area high-performance computing[J]. Front. Comput. Sci., 2022, 16(5): 165105-.
[7] Zhen SONG, Yu GU, Zhigang WANG, Ge YU. DRPS: efficient disk-resident parameter servers for distributed machine learning[J]. Front. Comput. Sci., 2022, 16(4): 164321-.
[8] Yu OU, Lang LI. Side-channel analysis attacks based on deep learning network[J]. Front. Comput. Sci., 2022, 16(2): 162303-.
[9] Zhangjie FU, Yan WANG, Xingming SUN, Xiaosong ZHANG. Semantic and secure search over encrypted outsourcing cloud based on BERT[J]. Front. Comput. Sci., 2022, 16(2): 162802-.
[10] Xinyu TONG, Ziao YU, Xiaohua TIAN, Houdong GE, Xinbing WANG. Improving accuracy of automatic optical inspection with machine learning[J]. Front. Comput. Sci., 2022, 16(1): 161310-.
[11] Suyu MEI. A framework combines supervised learning and dense subgraphs discovery to predict protein complexes[J]. Front. Comput. Sci., 2022, 16(1): 161901-.
[12] Arpita BISWAS, Abhishek MAJUMDAR, Soumyabrata DAS, Krishna Lal BAISHNAB. OCSO-CA: opposition based competitive swarm optimizer in energy efficient IoT clustering[J]. Front. Comput. Sci., 2022, 16(1): 161501-.
[13] Yi REN, Ning XU, Miaogen LING, Xin GENG. Label distribution for multimodal machine learning[J]. Front. Comput. Sci., 2022, 16(1): 161306-.
[14] Xiaobing SUN, Tianchi ZHOU, Rongcun WANG, Yucong DUAN, Lili BO, Jianming CHANG. Experience report: investigating bug fixes in machine learning frameworks/libraries[J]. Front. Comput. Sci., 2021, 15(6): 156212-.
[15] Xia-an BI, Yiming XIE, Hao WU, Luyun XU. Identification of differential brain regions in MCI progression via clustering-evolutionary weighted SVM ensemble algorithm[J]. Front. Comput. Sci., 2021, 15(6): 156903-.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed