CWLP：一种在GPU中协同的线程束调度和局部性保护的高速缓存分配策略

doi:10.1631/FITEE.1700059

Frontiers of Information Technology & Electronic Engineering

2018, Vol. 19

Issue (2): 206-220 https://doi.org/10.1631/FITEE.1700059

本期目录

CWLP：一种在GPU中协同的线程束调度和局部性保护的高速缓存分配策略

张洋(

), 邢座程, 刘苍, 唐川

国防科技大学计算机学院分布式与并行处理国防重点实验室，中国长沙市，410073

CWLP:coordinatedwarp scheduling and locality-protected cache allocation onGPUs

Yang ZHANG(

), Zuo-cheng XING, Cang LIU, Chuan TANG

National Laboratory for Parallel and Distributed Processing, National University of Defense Technology, Changsha 410073, China

全文: PDF(757 KB)

摘要:

随着我们正在接近百亿亿次超级计算机的时代，一个拥有强大运算能力和低能耗的均衡的计算机系统变得越来越重要。GPUs是在最近投入运营的超级计算机中被广泛使用的加速器。它采用大规模多块程来隐藏长访存延迟，同时它拥有高能效。相对于其强大的运算能力，GPUs的每个流多核处理器只有几兆的片上资源。面向吞吐率的执行模型与它的高速缓存层次结构设计不匹配，使得GPUs缓存表现出较差的运行效率。由于片上存储器的严重缺少，受较差的缓存性能影响，GPU的计算能力急剧下降，限制了系统性能和能效。提出一种协同的线程束调度和局部性保护的缓存分配策略（CWLP），以充分利用数据局部性和隐藏延迟。首先，设计了一种基于指令PC的局部性保护方法（LPC）以提升GPU性能。使用一个基于PC的收集器收集每个高速缓存块的重用信息。在获取缓存块的动态重用信息后，采用一个智能缓存分配单元（PCAU），它结合了重用信息和LRU（最近最少使用）替换策略，以找到拥有最少局部性的缓存块并将其逐出。此外，局部性信息被线程束调度器用来实现一个智能的重排序策略，用以获取局部性和隐藏延迟。实验结果表明，CWLP能够提供高达19.8%的性能加速比和超过基准策略平均8.8%的性能提升。

Abstract：

As we approach the exascale era in supercomputing, designing a balanced computer system with a powerful computing ability and low power requirements has becoming increasingly important. The graphics processing unit (GPU) is an accelerator used widely in most of recent supercomputers. It adopts a large number of threads to hide a long latency with a high energy efficiency. In contrast to their powerful computing ability, GPUs have only a few megabytes of fast on-chip memory storage per streaming multiprocessor (SM). The GPU cache is inefficient due to a mismatch between the throughput-oriented execution model and cache hierarchy design. At the same time, current GPUs fail to handle burst-mode long-access latency due to GPU’s poor warp scheduling method. Thus, benefits of GPU’s high computing ability are reduced dramatically by the poor cache management and warp scheduling methods, which limit the system performance and energy efficiency. In this paper, we put forward a coordinated warp scheduling and locality-protected (CWLP) cache allocation scheme to make full use of data locality and hide latency. We first present a locality-protected cache allocation method based on the instruction program counter (LPC) to promote cache performance. Specifically, we use a PC-based locality detector to collect the reuse information of each cache line and employ a prioritised cache allocation unit (PCAU) which coordinates the data reuse information with the time-stamp information to evict the lines with the least reuse possibility. Moreover, the locality information is used by the warp scheduler to create an intelligent warp reordering scheme to capture locality and hide latency. Simulation results show that CWLP provides a speedup up to 19.8% and an average improvement of 8.8% over the baseline methods.

Key words： Locality Graphics processing unit (GPU) Cache allocation Warp scheduling

收稿日期: 2017-01-19 出版日期: 2018-04-23

通讯作者: 张洋 E-mail: zhangyang@nudt.edu.cn

Corresponding Author(s): Yang ZHANG

引用本文:

张洋, 邢座程, 刘苍, 唐川. CWLP：一种在GPU中协同的线程束调度和局部性保护的高速缓存分配策略[J]. Frontiers of Information Technology & Electronic Engineering, 2018, 19(2): 206-220.
Yang ZHANG, Zuo-cheng XING, Cang LIU, Chuan TANG. CWLP:coordinatedwarp scheduling and locality-protected cache allocation onGPUs. Front. Inform. Technol. Electron. Eng, 2018, 19(2): 206-220.

链接本文:

https://academic.hep.com.cn/fitee/CN/10.1631/FITEE.1700059
https://academic.hep.com.cn/fitee/CN/Y2018/V19/I2/206

[1]		Download
[2]		Download

Viewed

Full text

Abstract

Cited

Shared

Discussed