Please wait a minute...
Frontiers of Computer Science

ISSN 2095-2228

ISSN 2095-2236(Online)

CN 10-1014/TP

Postal Subscription Code 80-970

2018 Impact Factor: 1.129

Front. Comput. Sci.    2017, Vol. 11 Issue (1) : 130-146    https://doi.org/10.1007/s11704-016-5468-8
RESEARCH ARTICLE
Understanding co-run performance on CPU-GPU integrated processors: observations, insights, directions
Qi ZHU1,2(),Bo WU3,Xipeng SHEN2,Kai SHEN4,Li SHEN1,Zhiying WANG1
1. National Key Laboratory of High Performance Computing, National University of Defense Technology, Changsha 410073, China
2. Department of Computer Science, North Carolina State University, Raleigh NC27695, USA
3. Department of Electrical Engineering & Computer Science, Colorado School of Mines, Golden CO80401, USA
4. Department of Computer Science, University of Rochester, Rochester NY14627, USA
 Download: PDF(1100 KB)  
 Export: BibTeX | EndNote | Reference Manager | ProCite | RefWorks
Abstract

Recent years have witnessed a processor development trend that integrates central processing unit (CPU) and graphic processing unit (GPU) into a single chip. The integration helps to save some host-device data copying that a discrete GPU usually requires, but also introduces deep resource sharing and possible interference between CPU and GPU. This work investigates the performance implications of independently co-running CPU and GPU programs on these platforms. First, we perform a comprehensive measurement that covers a wide variety of factors, including processor architectures, operating systems, benchmarks, timing mechanisms, inputs, and power management schemes. These measurements reveal a number of surprising observations.We analyze these observations and produce a list of novel insights, including the important roles of operating system (OS) context switching and power management in determining the program performance, and the subtle effect of CPU-GPU data copying. Finally, we confirm those insights through case studies, and point out some promising directions to mitigate anomalous performance degradation on integrated heterogeneous processors.

Keywords performance analysis      GPGPU      co-run degradation      fused processor      program transformation     
Corresponding Author(s): Qi ZHU   
Just Accepted Date: 23 March 2016   Online First Date: 14 September 2016    Issue Date: 11 January 2017
 Cite this article:   
Qi ZHU,Bo WU,Xipeng SHEN, et al. Understanding co-run performance on CPU-GPU integrated processors: observations, insights, directions[J]. Front. Comput. Sci., 2017, 11(1): 130-146.
 URL:  
https://academic.hep.com.cn/fcs/EN/10.1007/s11704-016-5468-8
https://academic.hep.com.cn/fcs/EN/Y2017/V11/I1/130
1 Markatos E P, LeBlanc T J. Using processor affinity in loop scheduling on shared-memory multiprocessors. IEEE Transactions on Parallel Distributed Systems, 1994, 5(4): 379–400
https://doi.org/10.1109/71.273046
2 Squillante M S, Lazowska E D. Using processor-cache affinity information in shared-memory multiprocessor scheduling. IEEE Transactions on Parallel and Distributed Systems, 1993, 4(2): 131–143
https://doi.org/10.1109/71.207589
3 Gelado I, Stone J E, Cabezas J, Patel S, Navarro N, Hwu W M W. An asymmetric distributed shared memory model for heterogeneous parallel systems. In: Proceedings of the 15th Edition of ASPLOS on Architectural Support for Programming Languages and Operating Systems. 2010, 347–358
https://doi.org/10.1145/1736020.1736059
4 Jiang Y, Shen X P, Chen J, Tripathi R. Analysis and approximation of optimal co-scheduling on chip multiprocessors. In: Proceedings of the International Conference on Parallel Architecture and Compilation Techniques. 2008, 220–229
https://doi.org/10.1145/1454115.1454146
5 Tian K, Jiang Y L, Shen X P. A study on optimally co-scheduling jobs of different lengths on chip multiprocessors. In: Proceedings of the 6th ACM Computing Frontiers. 2009, 41–50
https://doi.org/10.1145/1531743.1531752
6 Fedorova A, Seltzer M, Smith M D. Improving performance isolation on chip multiprocessors via an operating system scheduler. In: Proceedings of the 16th International Conference on Parallel Architecture and Compilation Techniques. 2007, 25–38
https://doi.org/10.1109/pact.2007.4336197
7 El-Moursy A, Garg R, Albonesi D H, Dwarkadas S. Compatible phase co-scheduling on a CMP of multi-threaded processors. In: Proceedings of the 20th International Parallel and Distributed Processing Symposium. 2006
https://doi.org/10.1109/IPDPS.2006.1639376
8 Menychtas K, Shen K, Scott M L. Disengaged scheduling for fair, protected access to computational accelerators. In: Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems. 2014, 301–316
https://doi.org/10.1145/2541940.2541963
9 Kato S, Lakshmanan K, Rajkumar R, Ishikawa Y. TimeGraph: GPU scheduling for real-time multi-tasking environments. In: Proceedings of the 2011 USENIX Annual Technical Conference. 2011, 17–30
10 Mekkat V, Holey A, Yew P C, Zhai A. Managing shared last-level cache in a heterogeneous multicore processor. In: Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques. 2013, 225–234
11 Tuck N, Tullsen D M. Initial observations of the simultaneous multithreading Pentium 4 processor. In: Proceedings of the 12th International Conference on Parallel Architectures and Compilation Techniques. 2003, 26–35
https://doi.org/10.1109/pact.2003.1237999
12 Fousek J, Filipoviˇc J, Madzin M. Automatic fusions of CUDA-GPU kernels for parallel map. ACM SIGARCH Computer Architecture News, 2011, 39(4): 98–99
https://doi.org/10.1145/2082156.2082183
13 Wang G B, Lin Y S, Yi W. Kernel fusion: an effective method for better power efficiency on multithreaded GPU. In: Proceedings of the 2010 IEEE/ACM International Conference on Green Computing and Communications & International Conference on Cyber, Physical and Social Computing (CPSCom). 2010, 344–350
https://doi.org/10.1109/greencom-cpscom.2010.102
14 Wu H C, Diamos G, Wang J, Cadambi S, Yalamanchili S, Chakradhar S. Optimizing data warehousing applications for GPUs using kernel fusion /fission. In: Proceedings of Parallel and Distributed Processing Symposium Workshops & PhD Forum (IPDPSW). 2012, 2433–2442
15 Aila T, Laine S. Understanding the efficiency of ray traversal on GPUs. In: Proceedings of the Conference on High Performance Graphics. 2009, 145–149
https://doi.org/10.1145/1572769.1572792
16 Chen L, Villa O, Krishnamoorthy S, Gao G R. Dynamic load balancing on single- and multi-GPU systems. In: Proceedings of 2010 IEEE International Symposium on Parallel and Distributed Processing (IPDPS). 2010, 1–12
https://doi.org/10.1109/IPDPS.2010.5470413
17 Gupta K, Stuart J A, Owens J D. A study of persistent threads style GPU programming for GPGPU workloads. In: Proceedings of Innovative Parallel Computing (InPar), 2012. 2012, 1–14
https://doi.org/10.1109/inpar.2012.6339596
18 Xiao S C, Feng W C. Inter-block GPU communication via fast barrier synchronization. In: Proceedings of the 2010 IEEE International Symposium on Parallel & Distributed Proceedings. 2010, 1–12
19 Li C P, Ding C, Shen K. Quantifying the cost of context switch. In: Proceedings of the 2007 Workshop on Experimental Computer Science. 2007, 2
https://doi.org/10.1145/1281700.1281702
20 Muralidhara S P, Subramanian L, Mutlu O, Kandemir M, Moscibroda T. Reducing memory interference in multicore systems via applicationaware memory channel partitioning. In: Proceedings of the 44th Annual IEEE/ACMInternational Symposium on Microarchitecture. 2011, 374–385
21 Liu L, Li Y, Cui Z H, Bao Y G, Chen M Y, Wu C Y. Going vertical in memory management: handling multiplicity by multi-policy. In: Proceedings of the 2014 ACM/IEEE 41st International Symposium on Computer Architecture. 2014, 169–180
https://doi.org/10.1109/ISCA.2014.6853214
22 Lin J, Lu Q D, Ding X N, Zhang Z, Zhang X D, Sadayappan P. Gaining insights into multicore cache partitioning: bridging the gap between simulation and real systems. In: Proceedings of the 14th IEEE International Symposium on High Performance Computer Architecture. 2008, 367–378
23 Liu L, Cui Z H, Xing M J, Bao Y G, Chen M Y, Wu C Y. A software memory partition approach for eliminating bank-level interference in multicore systems. In: Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques. 2012, 367–376
https://doi.org/10.1145/2370816.2370869
24 Chang J C, Sohi G S. Cooperative cache partitioning for chip multiprocessors. In: Proceedings of the 25th Annual International Conference on Supercomputing. 2007, 242–252
https://doi.org/10.1145/1274971.1275005
25 Rafique N, Lim W T, Thottethodi M. Architectural support for operating system-driven CMP cache management. In: Proceedings of the 15th International Conference on Parallel Architecture and Compilation Techniques. 2006, 2–12
https://doi.org/10.1145/1152154.1152160
26 Suh G E, Devadas S, Rudolph L. A new memory monitoring scheme for memory-aware scheduling and partitioning. In: Proceedings of the 8th International Symposium on High-Performance Computer Architecture. 2002, 117–128
https://doi.org/10.1109/HPCA.2002.995703
27 Qureshi M K, Patt Y N. Utility-based cache partitioning: a lowoverhead, high-performance, runtime mechanism to partition shared caches. In: Proceedings of the 39th International Symposium on Microarchitecture. 2006, 423–432
https://doi.org/10.1109/micro.2006.49
28 Zhang E Z, Jiang Y L, Shen X P. Does cache sharing on modern CMP matter to the performance of contemporary multithreaded programs? In: Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 2010, 203–212
https://doi.org/10.1145/1693453.1693482
29 Kim Y, Papamichael M, Mutlu O, Harchol-Balter M. Thread cluster memory scheduling: exploiting differences in memory access behavior. In: Proceedings of the 43rd Annual IEEE/ACMInternational Symposium on Microarchitecture (MICRO). 2010, 65–76
https://doi.org/10.1109/micro.2010.51
30 Cook H, Moreto M, Bird S, Dao K, Patterson D A, Asanovic K. A hardware evaluation of cache partitioning to improve utilization and energy-efficiency while preserving responsiveness. ACM SIGARCH Computer Architecture News, 2013, 41(3): 308–319
https://doi.org/10.1145/2508148.2485949
31 Xie Y J, Loh G H. Scalable shared-cache management by containing thrashing workloads. In: Proceedings of the 5th international conference on High Performance Embedded Architectures and Compilers. 2010, 262–276
https://doi.org/10.1007/978-3-642-11515-8_20
32 Mars J, Tang L J. Whare-Map: Heterogeneity in “homogeneous” warehouse-scale computers. In: Proceedings of the 40th International Symposium on Computer Architecture (ISCA). 2013, 1–12
https://doi.org/10.1145/2485922.2485975
33 Zahedi S M, Lee B C. REF: resource elasticity fairness with sharing incentives for multiprocessors. ACM SIGPLAN Notices, 2014, 49(4): 145–160
https://doi.org/10.1145/2541940.2541962
34 Ausavarungnirun R, Chang K K W, Subramanian L, Loh G H, Mutlu O. Staged memory scheduling: achieving high performance and scalability in heterogeneous systems. ACM SIGARCH Computer Architecture News, 2012, 40(3): 416–427
https://doi.org/10.1145/2366231.2337207
35 Zhu Q, Wu B, Shen X P, Shen L, Wang Z Y. Understanding co-run degradations on integrated heterogeneous processors. In: Proceedings of International Workshop on Languages and Compilers for Parallel Computing (LCPC). 2014, 82–97
[1] Qi ZHU, Bo WU, Xipeng SHEN, Kai SHEN, Li SHEN, Zhiying WANG. Resolving the GPU responsiveness dilemma through program transformations[J]. Front. Comput. Sci., 2018, 12(3): 545-559.
[2] Sudipta ROY, Debnath BHATTACHARYYA, Samir Kumar BANDYOPADHYAY, Tai-Hoon KIM. An improved brain MR image binarization method as a preprocessing for abnormality detection and features extraction[J]. Front. Comput. Sci., 2017, 11(4): 717-727.
[3] Ronggang BAI,Jiqiang XIA,Qingfeng MAN. Real-time performance analysis of non-saturated and non-slotted Ethernet based on a two-dimensional collision state model[J]. Front. Comput. Sci., 2015, 9(5): 788-805.
[4] Chenggang Clarence YAN,Hui YU,Weizhi XU,Yingping ZHANG,Bochuan CHEN,Zhu TIAN,Yuxuan WANG,Jian YIN. Memory bandwidth optimization of SpMV on GPGPUs[J]. Front. Comput. Sci., 2015, 9(3): 431-441.
[5] WANG Yuanzhuo, LIN Chuang, YANG Yang, SHAN Zhiguang. Performance analysis of a dependable scheduling strategy based on a fault-tolerant grid model[J]. Front. Comput. Sci., 2007, 1(3): 329-337.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed