Please wait a minute...
Frontiers of Computer Science

ISSN 2095-2228

ISSN 2095-2236(Online)

CN 10-1014/TP

Postal Subscription Code 80-970

2018 Impact Factor: 1.129

Front. Comput. Sci.    2022, Vol. 16 Issue (3) : 163104    https://doi.org/10.1007/s11704-020-9485-2
RESEARCH ARTICLE
Compressed page walk cache
Dunbo ZHANG, Chaoyang JIA, Li SHEN()
School of Computer, National University of Defense Technology, Changsha 410000, China
 Download: PDF(21189 KB)   HTML
 Export: BibTeX | EndNote | Reference Manager | ProCite | RefWorks
Abstract

GPUs are widely used in modern high-performance computing systems. To reduce the burden of GPU programmers, operating system and GPU hardware provide great supports for shared virtual memory, which enables GPU and CPU to share the same virtual address space. Unfortunately, the current SIMT execution model of GPU brings great challenges for the virtual-physical address translation on the GPU side, mainly due to the huge number of virtual addresses which are generated simultaneously and the bad locality of these virtual addresses. Thus, the excessive TLB accesses increase the miss ratio of TLB. As an attractive solution, Page Walk Cache (PWC) has received wide attention for its capability of reducing the memory accesses caused by TLB misses.

However, the current PWC mechanism suffers from heavy redundancies, which significantly limits its efficiency. In this paper, we first investigate the facts leading to this issue by evaluating the performance of PWC with typical GPU benchmarks. We find that the repeated L4 and L3 indices of virtual addresses increase the redundancies in PWC, and the low locality of L2 indices causes the low hit ratio in PWC. Based on these observations, we propose a new PWC structure, namely Compressed Page Walk Cache (CPWC), to resolve the redundancy burden in current PWC. Our CPWC can be organized in either direct-mapped mode or set-associated mode. Experimental results show that CPWC increases by 3 times over TPC in the number of page table entries, increases by 38.3% over PWC in L2 index hit ratio and reduces by 26.9% in the memory accesses of page tables. The average memory accesses caused by each TLB miss is reduced to 1.13. Overall, the average IPC can improve by 25.3%.

Keywords GPU      shared virtual memory      address translation      PWC     
Corresponding Author(s): Li SHEN   
Just Accepted Date: 27 September 2020   Issue Date: 09 November 2021
 Cite this article:   
Dunbo ZHANG,Chaoyang JIA,Li SHEN. Compressed page walk cache[J]. Front. Comput. Sci., 2022, 16(3): 163104.
 URL:  
https://academic.hep.com.cn/fcs/EN/10.1007/s11704-020-9485-2
https://academic.hep.com.cn/fcs/EN/Y2022/V16/I3/163104
Fig.1  The virtual-physical address translation process in a typical CPU-GPU architecture
Fig.2  The virtual address structure
Fig.3  A sampled snapshot of TPC entries during the execution of 2mm with redundancies in L4, L3 indices
Virtual address L4 index L3 index L2 index L1 index
1 0x7fd0c00714c0 0x0ff 0x143 0x000 0x071
2 0x7fd0c0072190 0x0ff 0x143 0x000 0x072
3 0x7fd0c02a8bf0 0x0ff 0x143 0x001 0x2a8
4 0x7fd10e0cdee0 0x0ff 0x144 0x040 0x0cd
5 0x7f734dc110df 0x0fe 0x1cd 0x06e 0x011
Tab.1  Virtual address list
Space/bits Looking up Redundancy ratio/%
UPTC 1233 3 (Serial) 0
SPTC 1233 3 (Serial) 0%
UTC 819 1 (Parallel) 55.5
STC 756 1 (Parallel) 51.9
TPC 876 1 (Parallel) 25
Tab.2  Space overhead, lookup method, and redundancy ratio between different PWC structures
S L2 DRAM
No PWC 3.85 0.15
UPTC 2.95 0.99 0.14
SPTC 2.96 0.97 0.14
UTC 1.11 0.98 0.14
STC 1.09 0.97 0.14
TPC 1.09 0.97 0.14
Tab.3  The average number of caching structure accesses (S), L2 data cache hits (L2), and DRAM accesses (DRAM) per TLB miss for the various PWC designs over the SPEC CFP2006, SPECjbb2005 and Sweep3d benchmarks
Fig.4  Basic structure of the direct-mapped CPWC
Benchmark Utilization/%
I AES 50
STO 50
dwt2d 75
RAY 25
3DC 25
pathfinder 50
II MUM 25
2DC 25
streamcluster 50
backprop 25
hotspot 50
gemm 25
b+tree 25
2mm 25
3mm 25
BFS 25
III gramschmidt 25
Tab.4  The utilization of direct-mapped CPWC
Name Number of indices L2 index hit ratio/% Redundancy ratio/% Relative redundancy/%
L4 L3 L2
I AES 1 2 326 43.09 62.5 93.75
STO 1 2 409 32.30 62.5 93.75
dwt2d 2 3 266 49.01 59.7 89.55
pathfinder 1 2 120 29.78 62.5 93.75
3DC 1 1 107 26.00 63.9 95.85
RAY 1 1 158 53.71 63.9 95.85
II MUM 1 1 154 68.46 63.9 95.85
2DC 1 1 121 28.73 63.9 95.85
streamcluster 1 2 81 49.35 62.5 93.75
backprop 1 1 81 33.03 63.9 95.85
hotspot 1 2 76 37.08 62.5 93.75
gemm 1 1 68 54.27 63.9 95.85
b+tree 1 1 66 42.07 63.9 95.85
2 mm 1 1 66 48.55 63.9 95.85
3 mm 1 1 66 44.12 63.9 95.85
BFS 1 1 64 52.71 63.9 95.85
III gramschmidt 1 1 39 97.77 63.9 95.85
Tab.5  Benchmarks, the L2 index hit ratios, and the redundancy of TPC
Fig.5  The basic structure of a set-associated CPWC
GPU core configurations
System overview 30 cores,64 execution units per core,8 memory partitions
Shader core config 1020 MHz,9-stage pipeline, 64 threads per warp, GTO scheduler [15]
TLB 64 or 32 entries, fully associative, LRU replacement strategy
Page walk cache (TPC) 24 entries 0.65 KB
Compressed page walk cache 64 entries 0.63 KB 4-entry L4C, 8-entry L3C, 64-entry L2C, 8 L2C Blocks
Tab.6  The configurations of GPGPU-Sim
Fig.6  IPC of TPC and CPWC
Fig.7  TLB hit ratios of benchmarks
Entries 8 24 48 64 128 256 512
TPC 1760 5280 10560 14080 28160 56320 112640
CPWC 1068 2252 4028 5212 9948 19420 38364
Capacity ratio (TPC/CPWC) 1.65 2.34 2.62 2.70 2.82 2.90 2.93
Redundency in TPC 814 3182 6734 9102 18574 37518 75406
Redundency in CPWC 0 0 0 0 0 0 0
Tab.7  The capacity (bits) of TPC and CPWC with the change of entry numbers and the capacity (bits) occupied by redundancy in CPWC and TPC
Capacity 1760 5280 10560 14080 28160 56320 112640
TPC 8 24 48 64 128 256 512
CPWC 18 66 138 186 378 762 1530
Entry number ratio (TPC/CPWC) 0.44 0.36 0.35 0.34 0.33 0.33 0.33
Tab.8  The entry number of TPC and CPWC with the change of capacities (bits)
Fig.8  The L2 index hit ratio
Benchmarks TPC CPWC Reduction/%
I STO 29709 26865 9.57
AES 17173 15391 10.38
dwt2d 1796387 1425653 20.64
pathfinder 1243711 966372 22.30
3DC 8797879 6340701 27.93
RAY 81957 67920 17.13
II MUM 6268554 5148341 17.87
2DC 5266354 3456806 34.36
streamcluster 10057437 6805982 32.33
backprop 697374 466437 33.12
hotspot 130450 84991 34.85
gemm 1453235 999987 31.19
b+tree 717066 454270 36.65
2mm 55500198 36684374 33.90
3mm 5813135 3744513 35.59
BFS 121390 82467 32.06
III gramschmidt 106133656 103821379 2.18
Tab.9  Number of page table accesses with TPC and CPWC
Fig.9  Lifetime of the L2 indices. (a) 2DC; (b) MUM ; (c) 3DC
Fig.10  The relationship between L2C entries and L2C hit ratio
1 Power J, Hill M D, Wood D A. Supporting x86-64 address translation for 100s of GPU lanes. In: Proceedings of the 20th IEEE International Symposium on High Performance Computer Architecture. 2014, 568−578
2 Chatterjee N, Connor M O, Loh G H, Jayasena N, Balasubramonian R. Managing dram latency divergence in irregular gpgpu applications. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 2014, 128−139
3 Burtscher M, Nasre R, Pingali K. A quantitative study of irregular programs on GPUs. In: Proceedings of IEEE International Symposium on Workload Characterization. 2012, 141−151
4 Meng J, Tarjan D, Skadron K. Dynamic warp subdivision for integrated branch and memory divergence tolerance. In: Proceedings of the 37th International Symposium on Computer Architecture. 2010, 235−246
5 Vesely J, Basu A, Oskin M. Observations and opportunities in architecting shared virtual memory for heterogeneous systems. In: Proceedings of 2016 IEEE International Symposium on Performance Analysis of Systems and Software. 2016, 161−171
6 Bhattacharjee A. Large-reach memory management unit caches. In: Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture. 2013, 283−394
7 Shin S, Cox G, Oskin M, Loh G H, Solihin Y, Bhattacharjee A, Basu A. Scheduling page table walks for irregular GPU applications. In: Proceedings of the 45th ACM/IEEE Annual International Symposium on Computer Architecture. 2018, 180−192
8 Barr T W, Cox A L, Rixner S. Translation caching: skip, don’t walk (the page table). In: Proceedings of the 37th International Symposium on Computer Architecture. 2010, 48−59
9 Ausavarungnirun R, Landgraf J, Miller V, Ghose S, Gandhi J, Rossbach C J, Mutlu O. Mosaic: an application-transparent hardware-software cooperative memory manager for GPUs. Computing Research Repository, 2018, arXiv preprint arXiv: 1804.11265
10 Ausavarungnirun R, Landgraf J, Miller V, Ghose S, Mutlu O. Mosaic: a GPU memory manager with application-transparent support for multiple page sizes. In: Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture. 2017, 136−150
11 X Mei, X Chu. Dissecting GPU memory hierarchy through microbenchmarking. IEEE Transactions on Parallel & Distributed Systems, 2016, 28( 1): 72– 86
12 Lee D, Subramanian L, Ausavarungnirun R, Choi J, Mutlu O. Decoupled direct memory access: isolating CPU and IO traffic by leveraging a dual-data-port dram. In: Proceedings of International Conference on Parallel Architectures and Compilation. 2015, 174−187
13 Kurth A, Vogel P, Marongiu A, Benini L. Scalable and efficient virtual memory sharing in heterogeneous SOCs with TLB prefetching and MMU-aware DMA engine. In: Proceedings of the 36th IEEE International Conference on Computer Design. 2018, 292−300
14 A Seznec. Concurrent support of multiple page sizes on a skewed associative TLB. IEEE Transactions on Computers, 2004, 53( 7): 924– 927
15 Rogers T G, Connor M O, Aamodt T M. Cache-conscious wavefront scheduling. In: Proceedings of the 45th Annual IEEE/ACM International Symposium on Microarchitecture. 2012, 72−83
16 Bakhoda A, Yuan G L, Fung W W L, Wong H, Aamodt T M. Analyzing CUDA workloads using a detailed GPU simulator. In: Proceedings of IEEE International Symposium on Performance Analysis of Systems and Software. 2009, 163−174
17 Che S, Boyer M, Meng J, Tarjan D, Skadron K. Rodinia: a benchmark suite for heterogeneous computing. In: Proceedings of the 2009 IEEE International Symposium on Workload Characterization. 2009, 44−54
18 Karimov J, Rabl T, Markl V. PolyBench: the first benchmark for polystores. In: Proceedings of Technology Conference on Performance Evaluation and Benchmarking, 2018, 24−41
19 Basu A, Gandhi J, Chang J, Hill M D, Swift M M. Efficient virtual memory for big memory servers. In: Proceedings of the 40th Annual International Symposium on Computer Architecture. 2013, 237−248
20 Basu A. Revisiting virtual memory. University of Wisconsin at Madison, Dissertation, 2013
21 Gandhi J, Basu A, Hill M D, Swift M M. Efficient memory virtualization: reducing dimensionality of nested page walks. In: Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture. 2014, 178−189
22 Pham B, Bhattacharjee A, Eckert Y, Loh G H. Increasing TLB reach by exploiting clustering in page translations. In: Proceedings of the 20th IEEE International Symposium on High Performance Computer Architecture. 2014, 558−567
23 Shin S, LeBeane M, Solihin Y, Basu A. Neighborhood-aware address translation for irregular GPU applications. In: Proceedings of the 51st Annual IEEE/ACM International Symposium on Microarchitecture. 2018, 252−363
[1] Xin YOU, Hailong YANG, Zhongzhi LUAN, Depei QIAN. Accelerating the cryo-EM structure determination in RELION on GPU cluster[J]. Front. Comput. Sci., 2022, 16(3): 163102-.
[2] Neng HOU, Fazhi HE, Yi ZHOU, Yilin CHEN. An efficient GPU-based parallel tabu search algorithm for hardware/software co-design[J]. Front. Comput. Sci., 2020, 14(5): 145316-.
[3] Qi ZHU, Bo WU, Xipeng SHEN, Kai SHEN, Li SHEN, Zhiying WANG. Resolving the GPU responsiveness dilemma through program transformations[J]. Front. Comput. Sci., 2018, 12(3): 545-559.
[4] Qi ZHU,Bo WU,Xipeng SHEN,Kai SHEN,Li SHEN,Zhiying WANG. Understanding co-run performance on CPU-GPU integrated processors: observations, insights, directions[J]. Front. Comput. Sci., 2017, 11(1): 130-146.
[5] Chenggang Clarence YAN,Hui YU,Weizhi XU,Yingping ZHANG,Bochuan CHEN,Zhu TIAN,Yuxuan WANG,Jian YIN. Memory bandwidth optimization of SpMV on GPGPUs[J]. Front. Comput. Sci., 2015, 9(3): 431-441.
[6] Ling LIU. Computing infrastructure for big data processing[J]. Front Comput Sci, 2013, 7(2): 165-170.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed