Please wait a minute...
Frontiers of Computer Science

ISSN 2095-2228

ISSN 2095-2236(Online)

CN 10-1014/TP

Postal Subscription Code 80-970

2018 Impact Factor: 1.129

Front. Comput. Sci.    2018, Vol. 12 Issue (3) : 545-559    https://doi.org/10.1007/s11704-016-6206-y
RESEARCH ARTICLE
Resolving the GPU responsiveness dilemma through program transformations
Qi ZHU1,2,3(), Bo WU4, Xipeng SHEN3, Kai SHEN5, Li SHEN1, Zhiying WANG1
1. National Key Laboratory of High Performance Computing, National University of Defense Technology, Changsha 410073, China
2. Jiangnan Institute of Computing Technology,Wuxi 214083, China
3. Department of Computer Science, North Carolina State University, Raleigh NC 27695, USA
4. EECS, Colorado School of Mines, Golden CO 80401, USA
5. Department of Computer Science, University of Rochester, Rochester NY 14627, USA
 Download: PDF(1711 KB)  
 Export: BibTeX | EndNote | Reference Manager | ProCite | RefWorks
Abstract

The emerging integrated CPU–GPU architectures facilitate short computational kernels to utilize GPU acceleration. Evidence has shown that, on such systems, the GPU control responsiveness (how soon the host program finds out about the completion of a GPU kernel) is essential for the overall performance. This study identifies the GPU responsiveness dilemma: host busy polling responds quickly, but at the expense of high energy consumption and interference with co-running CPU programs; interrupt-based notification minimizes energy and CPU interference costs, but suffers from substantial response delay. We present a programlevel solution that wakes up the host program in anticipation of GPU kernel completion.We systematically explore the design space of an anticipatorywakeup scheme through a timerdelayed wakeup or kernel splitting-based pre-completion notification. Experiments show that our proposed technique can achieve the best of both worlds, high responsivenesswith low power and CPU costs, for a wide range of GPU workloads.

Keywords program transformation      GPU      integrated architecture      responsiveness     
Corresponding Author(s): Qi ZHU   
Just Accepted Date: 08 November 2016   Online First Date: 06 March 2018    Issue Date: 02 May 2018
 Cite this article:   
Qi ZHU,Bo WU,Xipeng SHEN, et al. Resolving the GPU responsiveness dilemma through program transformations[J]. Front. Comput. Sci., 2018, 12(3): 545-559.
 URL:  
https://academic.hep.com.cn/fcs/EN/10.1007/s11704-016-6206-y
https://academic.hep.com.cn/fcs/EN/Y2018/V12/I3/545
1 Zhu Q, Zhu M, Wu B, Shen X, Shen K, Wang Z. Software engagement with sleeping CPUs. In: Proceedings of the 15th Workshop on Hot Topics in Operating Systems (HotOS). 2015
2 Gupta K, Stuart J A, Owens J D. A study of persistent threads style GPU programming for GPGPU workloads. Innovative Parallel Computing (InPar). 2012
3 Lee S, Johnson T, Eigenmann R. Cetus- an extensible compiler infrastructure for source-to-source transformation. In: Proceedings of the 16th AnnualWorkshop on Languages and Compilers for Parallel Computing. 2003, 539–553
4 Gonzalez R, Horowitz M. Energy dissipation in general purpose microprocessors. IEEE Journal of Solid-State Circuits, 1996, 31(9): 1277–1284
https://doi.org/10.1109/4.535411
5 Mekkat V, Holey A, Yew P C, Zhai A. Managing shared last-level cache in a heterogeneous multicore processor. In: Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques. 2013, 225–234
6 Zhu Q, Wu B, Shen X, Shen L, Wang Z. Understanding co-run degradations on integrated heterogeneous processors. In: Proceedings of International Workshop on Languages and Compilers for Parallel Computing. 2014, 82–97
7 Markatos E P, LeBlanc T J. Using processor affinity in loop scheduling on shared-memory multiprocessors. IEEE Transactions on Parallel Distributed Systems, 1994, 5(4): 379–400
https://doi.org/10.1109/71.273046
8 Squillante M S, Lazowska E D. Using processor-cache affinity information in shared-memory multiprocessor scheduling. IEEE Transactions on Parallel and Distributed Systems, 1993, 4(2): 131–143
https://doi.org/10.1109/71.207589
9 Gelado I, Stone J E, Cabezas J, Patel S, Navarro N, W. Hwu m W. An asymmetric distributed shared memory model for heterogeneous parallel systems. In: Proceedings of the 15th International Conference on Architectural Support for Programming Languages and Operating Systems. 2010, 347–358
10 Jiang Y, Shen X, Chen J, Tripathi R. Analysis and approximation of optimal co-scheduling on chip multiprocessors. In: Proceedings of the International Conference on Parallel Architecture and Compilation Techniques. 2008, 220–229
https://doi.org/10.1145/1454115.1454146
11 Tian K, Jiang Y, Shen X. A study on optimally co-scheduling jobs of different lengths on chip multiprocessors. In: Proceedings of the 6th ACM Conference on Computing Frontiers. 2009, 41–50
https://doi.org/10.1145/1531743.1531752
12 Fedorova A, Seltzer M, Smith M D. Improving performance isolation on chip multiprocessors via an operating system scheduler. In: Proceedings of the International Conference on Parallel Architecture and Compilation Techniques. 2007, 25–38
https://doi.org/10.1109/PACT.2007.4336197
13 El-Moursy A, Garg R, Albonesi D H, Dwarkadas S. Compatible phase co-scheduling on a cmp of multi-threaded processors. In: Proceedings of the International Parallel and Distribute Processing Symposium. 2006
https://doi.org/10.1109/IPDPS.2006.1639376
14 Chang J, Sohi G. Cooperative cache partitioning for chip multiprocessors. In: Proceedings of the 21st Annual International Conference on Supercomputing. 2007, 242–252
https://doi.org/10.1145/1274971.1275005
15 Rafique N, Lim W, Thottethodi M. Architectural support for operating system-driven CMP cache management. In: Proceedings of the International Conference on Parallel Architecture and Compilation Techniques. 2006, 2–12
https://doi.org/10.1145/1152154.1152160
16 Suh G, Devadas S, Rudolph L. A new memory monitoring scheme for memory-aware scheduling and partitioning. In: Proceedings of the 8th International Symposium on High-Performance Computer Architecture. 2002, 117–128
https://doi.org/10.1109/HPCA.2002.995703
17 Qureshi M K, Patt Y N. Utility-based cache partitioning: a lowoverhead, high-performance, runtime mechanism to partition shared caches. In: Proceedings of the International Symposium on Microarchitecture. 2006, 423–432
https://doi.org/10.1109/MICRO.2006.49
18 Zhang E Z, Jiang Y, Shen X. Does cache sharing on modern cmpmatter to the performance of contemporary multithreaded programs? In: Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 2010, 203–212
19 Mars J, Tang L, Hundt R. Whare-map: Heterogeneity in “homogeneous” warehouse-scale computers. In: Proceedings of the 40th International Symposium on Computer Architecture. 2013, 1–12
https://doi.org/10.1145/2485922.2485975
20 Zahedi S M, Lee B C. Ref: resource elasticity fairness with sharing incentives for multiprocessors. In: Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems. 2014
https://doi.org/10.1145/2541940.2541962
21 Menychtas K, Shen K, Scott M L. Disengaged scheduling for fair, protected access to computational accelerators. In: Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems. 2014, 301–316
https://doi.org/10.1145/2541940.2541963
22 Kato S, Lakshmanan K, Rajkumar R, Ishikawa Y. TimeGraph: GPU scheduling for real-time multi-tasking environments. In: Proceedings of the USENIX Annual Technical Conference. 2011
23 Wong H, Bracy A, Schuchman E, Aamodt T M, Collins J D, Wang P H, Chinya G, Groen A K, Jiang H, Wang H. Pangaea: a tightlycoupled ia32 heterogeneous chip multiprocessor. In: Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques. 2008, 52–61
https://doi.org/10.1145/1454115.1454125
[1] Neng HOU, Fazhi HE, Yi ZHOU, Yilin CHEN. An efficient GPU-based parallel tabu search algorithm for hardware/software co-design[J]. Front. Comput. Sci., 2020, 14(5): 145316-.
[2] Qi ZHU,Bo WU,Xipeng SHEN,Kai SHEN,Li SHEN,Zhiying WANG. Understanding co-run performance on CPU-GPU integrated processors: observations, insights, directions[J]. Front. Comput. Sci., 2017, 11(1): 130-146.
[3] Chenggang Clarence YAN,Hui YU,Weizhi XU,Yingping ZHANG,Bochuan CHEN,Zhu TIAN,Yuxuan WANG,Jian YIN. Memory bandwidth optimization of SpMV on GPGPUs[J]. Front. Comput. Sci., 2015, 9(3): 431-441.
[4] Ling LIU. Computing infrastructure for big data processing[J]. Front Comput Sci, 2013, 7(2): 165-170.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed