Please wait a minute...
Frontiers of Computer Science

ISSN 2095-2228

ISSN 2095-2236(Online)

CN 10-1014/TP

Postal Subscription Code 80-970

2018 Impact Factor: 1.129

Front. Comput. Sci.    2024, Vol. 18 Issue (2) : 182103    https://doi.org/10.1007/s11704-023-2675-y
Architecture
A hybrid memory architecture supporting fine-grained data migration
Ye CHI1, Jianhui YUE2, Xiaofei LIAO1(), Haikun LIU1, Hai JIN1
1. National Engineering Research Center for Big Data Technology and System, Services Computing Technology and System Lab, Cluster and Grid Computing Lab, School of Computer Science and Technology, Huazhong University of Science and Technology, Wuhan 430074, China
2. Department of Computer Science, Michigan Technological University, Michigan 49931, USA
 Download: PDF(6851 KB)   HTML
 Export: BibTeX | EndNote | Reference Manager | ProCite | RefWorks
Abstract

Hybrid memory systems composed of dynamic random access memory (DRAM) and Non-volatile memory (NVM) often exploit page migration technologies to fully take the advantages of different memory media. Most previous proposals usually migrate data at a granularity of 4 KB pages, and thus waste memory bandwidth and DRAM resource. In this paper, we propose Mocha, a non-hierarchical architecture that organizes DRAM and NVM in a flat address space physically, but manages them in a cache/memory hierarchy. Since the commercial NVM device–Intel Optane DC Persistent Memory Modules (DCPMM) actually access the physical media at a granularity of 256 bytes (an Optane block), we manage the DRAM cache at the 256-byte size to adapt to this feature of Optane. This design not only enables fine-grained data migration and management for the DRAM cache, but also avoids write amplification for Intel Optane DCPMM. We also create an Indirect Address Cache (IAC) in Hybrid Memory Controller (HMC) and propose a reverse address mapping table in the DRAM to speed up address translation and cache replacement. Moreover, we exploit a utility-based caching mechanism to filter cold blocks in the NVM, and further improve the efficiency of the DRAM cache. We implement Mocha in an architectural simulator. Experimental results show that Mocha can improve application performance by 8.2% on average (up to 24.6%), reduce 6.9% energy consumption and 25.9% data migration traffic on average, compared with a typical hybrid memory architecture–HSCC.

Keywords non-volatile memory      hybrid memory system      data migration      fine-grained caching     
Corresponding Author(s): Xiaofei LIAO   
Just Accepted Date: 02 February 2023   Issue Date: 10 April 2023
 Cite this article:   
Ye CHI,Jianhui YUE,Xiaofei LIAO, et al. A hybrid memory architecture supporting fine-grained data migration[J]. Front. Comput. Sci., 2024, 18(2): 182103.
 URL:  
https://academic.hep.com.cn/fcs/EN/10.1007/s11704-023-2675-y
https://academic.hep.com.cn/fcs/EN/Y2024/V18/I2/182103
Fig.1  Traditional hybrid memory architectures. (a) Hierarchical structure; (b) Flat structure
Workloads Hot ratio/% Workloads Hot ratio/%
omnetpp 45.0 soplex 20.0
astar 23.8 xalan 30.6
bzip2 21.9 Mean 28.1
Tab.1  Hot subpage ratio
Fig.2  Architecture of Mocha
Fig.3  Memory access flow after LLC miss
Fig.4  Sturcture of reverse mapping table and IAC entry. (a) Reverse mapping table; (b) IAC entry
Fig.5  Some parameters in differert time slots
  
Configuration
CPU 3.2 GHz, in-order, 8 cores
L1 Cache 4-way, split D/I, 3-cycle latency, private 64 KB per core
L2 Cache 8-way, 10-cycle latency, private 256 KB per core
L3 Cache 16-way, 34-cycle latency, shared 8 MB
IAC 4-way, 1-cycle latency, private 10 KB per core
DRAM 1 GB (with 20 MB reverse mapping table): 1 channel, 1 rank, 8 banks, 64 cols, 32768 rows Timing (tCAS-tRCD-tRP-tRAS) : 7-7-7-18 (cycles), 13.5 ns read latency, 28.5 ns write latencyBandwidth: 10.7 GB/sec, FR-FCFS request scheduling
PCM 32 GB: 4 channels, 8 ranks, 8 banks/rank, 65536 rows, 32 colsTiming (tCAS-tRCD-tRP-tRAS): 9-37-100-53 (cycles),19.5 ns read latency, 171 ns write latencyBandwidth: 10.7 GB/sec, FR-FCFS request scheduling
Energy consumption
DRAM Voltage: 1.5V, Standby: 77 mA, Precharge: 37 mA, Refresh: 160 mARead and write (row buffer hit): 120 mA and 125 mARead and write (row buffer miss): 237 mA and 242 mA
PCM Read/write (row buffer hit): 1.616 pJ/bitRead and write (row buffer miss): 81.2 pJ/bit and 1684.8 pJ/bit
Tab.2  Detailed system confguration
Workloads Appications
SPEC CPU mcf, milc, lbm, sphinx3, astar, omnetpp,
2006 bzip2, cactusADM, wrf
Parsec Canneal, bodytrack
PBBS BFS, ISORT, DICT
MIX1 milc+lbm+mcf+astar
MIX2 sphinx3+astar+omnetpp+bzip2
MIX3 bzip2+lbm+cactusADM+BFS
MIX4 milc+lbm+astar+DICT+ISORT+Canneal
MIX5 mcf+milc+sphinx3+bzip2+DICT+bodytrack
Tab.3  Workloads
Fig.6  Normalized IPCs relative to the baseline
Fig.7  DRAM utilization of different workloads
Fig.8  MPKI of different workloads
Fig.9  The IAC hit rate varies with its capacity
Fig.10  Energy consumption normalized to baseline system
Fig.11  Normalized data migration traffic
  
  
  
  
  
1 J, Li C Lam . Phase change memory. Science China Information Sciences, 2011, 54( 5): 1061–1072
2 M, Cai H Huang . A survey of operating system support for persistent memory. Frontiers of Computer Science, 2021, 15( 4): 154207
3 J, Izraelevitz J, Yang L, Zhang J, Kim X, Liu A, Memaripour Y J, Soh Z, Wang Y, Xu S R, Dulloor J, Zhao S Swanson . Basic performance measurements of the INTEL optane DC persistent memory module. 2019, arXiv preprint arXiv: 1903.05714
4 G, Loh M D Hill . Supporting very large DRAM caches with compound-access scheduling and missmap. IEEE Micro, 2012, 32( 3): 70–78
5 H, Liu Y, Chen X, Liao H, Jin B, He L, Zheng R Guo . Hardware/software cooperative caching for hybrid DRAM/NVM memory architectures. In: Proceedings of International Conference on Supercomputing. 2017, 26
6 M K, Qureshi V, Srinivasan J A Rivers . Scalable high performance main memory system using phase-change memory technology. In: Proceedings of the 36th Annual International Symposium on Computer Architecture. 2009, 24−33
7 Yoon H, Meza J, Ausavarungnirun R, Harding R A, Mutlu O. Row buffer locality aware caching policies for hybrid memories. In: Proceedings of the 30th IEEE International Conference on Computer Design. 2012, 337−344
8 Chen C, An J. DRAM write-only-cache for improving lifetime of phase change memory. In: Proceedings of the 59th IEEE International Midwest Symposium on Circuits and Systems. 2016, 1−4
9 A, Awad A, Basu S, Blagodurov Y, Solihin G H Loh . Avoiding TLB shootdowns through self-invalidating TLB entries. In: Proceedings of the 26th International Conference on Parallel Architectures and Compilation Techniques. 2017, 273−287
10 E, Vasilakis V, Papaefstathiou P, Trancoso I Sourdis . LLC-guided data migration in hybrid memory systems. In: Proceedings of 2019 IEEE International Parallel and Distributed Processing Symposium. 2019, 932−942
11 Loh G H, Hill M D. Efficiently enabling conventional block sizes for very large die-stacked DRAM caches. In: Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture. 2011, 454−464
12 Jevdjic D, Loh G H, Kaynak C, Falsafi B. Unison cache: a scalable and effective die-stacked DRAM cache. In: Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture. 2014, 25−37
13 E G, Hallnor S K Reinhardt . A fully associative software-managed cache design. In: Proceedings of the 27th International Symposium on Computer Architecture. 2000, 107−116
14 M, Oskin G H Loh . A software-managed approach to die-stacked DRAM. In: Proceedings of 2015 International Conference on Parallel Architecture and Compilation. 2015, 188−200
15 X, Wang H, Liu X, Liao J, Chen H, Jin Y, Zhang L, Zheng B, He S Jiang . Supporting superpages and lightweight page migration in hybrid memory systems. ACM Transactions on Architecture and Code Optimization, 2019, 16( 2): 11
16 J H, Ryoo L K, John A Basu . A case for granularity aware page migration. In: Proceedings of 2018 International Conference on Supercomputing. 2018, 352−362
17 D, Sanchez C Kozyrakis . ZSim: fast and accurate microarchitectural simulation of thousand-core systems. ACM SIGARCH Computer Architecture News, 2013, 41( 3): 475–486
18 M, Poremba Y Xie . Nvmain: an architectural-level main memory simulator for emerging non-volatile memories. In: Proceedings of 2012 IEEE Computer Society Annual Symposium on VLSI. 2012, 392−397
19 M, Poremba T, Zhang Y Xie . Nvmain 2.0: a user-friendly memory simulator to model (non-)volatile memory systems. IEEE Computer Architecture Letters, 2015, 14( 2): 140–143
20 Y, Hao S, Xiang G, Han J, Zhang X, Ma Z, Zhu X, Guo Y, Zhang Y, Han Z, Song Y, Liu L, Yang H, Zhou J, Shi W, Zhang M, Xu W, Zhao B, Pan Y, Huang Q, Liu Y, Cai J, Zhu X, Ou T, You H, Wu B, Gao Z, Zhang G, Guo Y, Chen Y, Liu X, Chen C, Xue X, Wang L, Zhao X, Zou L, Yan M Li . Recent progress of integrated circuits and optoelectronic chips. Science China Information Sciences, 2021, 64( 10): 201401
21 Y, Lu D, Wu B, He X, Tang J, Xu M Guo . Rank-aware dynamic migrations and adaptive demotions for dram power management. IEEE Transactions on Computers, 2016, 65( 1): 187–202
22 Y, Lu B, He X, Tang M Guo . Synergy of dynamic frequency scaling and demotion on DRAM power management: models and optimizations. IEEE Transactions on Computers, 2015, 64( 8): 2367–2381
23 S, Mittal J S Vetter . A survey of software techniques for using non-volatile memories for storage and main memory systems. IEEE Transactions on Parallel and Distributed Systems, 2016, 27( 5): 1537–1550
24 J, Zhang M, Guo C, Wu Y Chen . Toward multi-programmed workloads with different memory footprints: a self-adaptive last level cache scheduling scheme. Science China Information Sciences, 2018, 61( 1): 012105
25 Gulur N, Mehendale M, Manikantan R, Govindarajan R. Bi-modal DRAM cache: Improving hit rate, hit latency and bandwidth. In: Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture. 2014, 38−50
26 C C, Huang V Nagarajan . ATCache: reducing DRAM cache latency via a small SRAM tag cache. In: Proceedings of the 23rd International Conference on Parallel Architecture and Compilation Techniques. 2014, 51−60
27 D, Yang H, Liu H, Jin Y Zhang . HMvisor: dynamic hybrid memory management for virtual machines. Science China Information Sciences, 2021, 64( 9): 192104
28 T, Chen H, Liu X, Liao H Jin . Resource abstraction and data placement for distributed hybrid memory pool. Frontiers of Computer Science, 2021, 15( 3): 153103
29 X, Jiang N, Madan L, Zhao M, Upton R, Iyer S, Makineni D, Newell Y, Solihin R Balasubramonian . CHOP: adaptive filter-based DRAM caching for CMP server platforms. In: Proceedings of the 16th International Symposium on High-Performance Computer Architecture. 2010, 1−12
30 P, Chen J, Yue X, Liao H Jin . Trade-off between hit rate and hit latency for optimizing dram cache. IEEE Transactions on Emerging Topics in Computing, 2021, 9( 1): 55–64
31 C K, Luk R, Cohn R, Muth H, Patil A, Klauser G, Lowney S, Wallace V J, Reddi K Hazelwood . Pin: building customized program analysis tools with dynamic instrumentation. ACM SIGPLAN Notices, 2005, 40( 6): 190–200
32 B C, Lee E, Ipek O, Mutlu D Burger . Architecting phase change memory as a scalable DRAM alternative. ACM SIGARCH Computer Architecture News, 2009, 37( 3): 2–13
33 J L Henning . SPEC CPU2006 benchmark descriptions. ACM SIGARCH Computer Architecture News, 2006, 34( 4): 1–17
34 J, Shun G E, Blelloch J T, Fineman P B, Gibbons A, Kyrola H V, Simhadri K Tangwongsan . Brief announcement: the problem based benchmark suite. In: Proceedings of the 24th Annual ACM Symposium on Parallelism in Algorithms and Architectures. 2012, 68−70
35 C, Bienia S, Kumar J P, Singh K Li . The PARSEC benchmark suite: characterization and architectural implications. In: Proceedings of 2008 International Conference on Parallel Architectures and Compilation Techniques. 2008, 72−81
36 Q, Zhang X, Sui R, Hou L Zhang . Line-coalescing dram cache. Sustainable Computing: Informatics and Systems, 2021, 29: 100449
37 D, Jevdjic S, Volos B Falsafi . Die-stacked dram caches for servers: Hit ratio, latency, or bandwidth? Have it all with footprint cache. ACM SIGARCH Computer Architecture News, 2013, 41( 3): 404–415
38 Agarwal N, Wenisch T F. Thermostat: application-transparent page management for two-tiered main memory. In: Proceedings of the 22nd International Conference on Architectural Support for Programming Languages and Operating Systems. 2017, 631−644
39 Aswathy N S, Bhavanasi S, Sarkar A, Kapoor H K. SRS-Mig: selection and run-time scheduling of page migration for improved response time in hybrid PCM-DRAM memories. In: Proceedings of Great Lakes Symposium on VLSI 2022. 2022, 217−222
[1] FCS-22675-OF-YC_suppl_1 Download
[1] Jiantong HUO, Zhisheng HUO, Limin XIAO, Zhenxue HE. Research on performance optimization of virtual data space across WAN[J]. Front. Comput. Sci., 2024, 18(6): 186505-.
[2] Chong TIAN, Haikun LIU, Xiaofei LIAO, Hai JIN. UCat: heterogeneous memory management for unikernels[J]. Front. Comput. Sci., 2023, 17(1): 171204-.
[3] Haitao WANG, Zhanhuai LI, Xiao ZHANG, Xiaonan ZHAO, Song JIANG. WOBTree: a write-optimized B+-tree for non-volatile memory[J]. Front. Comput. Sci., 2021, 15(5): 155106-.
[4] ZENG Lingfang, FENG Dan, JIANG Hong. High TPO/TCO for data storage: policy, algorithm and early practice[J]. Front. Comput. Sci., 2007, 1(3): 349-360.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed