Please wait a minute...
Frontiers of Computer Science

ISSN 2095-2228

ISSN 2095-2236(Online)

CN 10-1014/TP

Postal Subscription Code 80-970

2018 Impact Factor: 1.129

Front. Comput. Sci.    2018, Vol. 12 Issue (6) : 1090-1104    https://doi.org/10.1007/s11704-017-6349-5
RESEARCH ARTICLE
HSCS: a hybrid shared cache scheduling scheme for multiprogrammed workloads
Jingyu ZHANG1,2, Chentao WU1, Dingyu YANG1,3(), Yuanyi CHEN1, Xiaodong MENG1, Liting XU1, Minyi GUO1()
1. Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai 200240, China
2. Hunan Provincial Key Laboratory of Intelligent Processing of Big Data on Transportation, School of Computer and Communication Engineering, Changsha University of Science and Technology, Changsha 410004, China
3. School of Electronics and Information, Shanghai Dianji University, Shanghai 200240, China
 Download: PDF(727 KB)  
 Export: BibTeX | EndNote | Reference Manager | ProCite | RefWorks
Abstract

The traditional dynamic random-access memory (DRAM) storage medium can be integrated on chips via modern emerging 3D-stacking technology to architect a DRAM shared cache in multicore systems. Compared with static random-access memory (SRAM), DRAM is larger but slower. In the existing research, a lot of work has been devoted to improving the workload performance using SRAM and stacked DRAM together in shared cache systems, ranging from SRAM structure improvement to optimizing cache tags and data access. However, little attention has been paid to designing a shared cache scheduling scheme for multiprogrammed workloads with different memory footprints in multicore systems. Motivated by this, we propose a hybrid shared cache scheduling scheme that allows a multicore system to utilize SRAM and 3D-stacked DRAM efficiently, thus achieving better workload performance. This scheduling scheme employs (1) a cache monitor, which is used to collect cache statistics; (2) a cache evaluator, which is used to evaluate the cache information during the process of programs being executed; and (3) a cache switcher, which is used to self-adaptively choose SRAM or DRAM shared cache modules. A cache data migration policy is naturally developed to guarantee that the scheduling scheme works correctly. Extensive experiments are conducted to evaluate the workload performance of our proposed scheme. The experimental results showed that our method can improve the multiprogrammed workload performance by up to 25% compared with state-of-the-art methods (including conventional and DRAM cache systems).

Keywords multicore system      shared cache      workload performance     
Corresponding Author(s): Dingyu YANG,Minyi GUO   
Just Accepted Date: 16 May 2017   Online First Date: 08 June 2018    Issue Date: 04 December 2018
 Cite this article:   
Jingyu ZHANG,Chentao WU,Dingyu YANG, et al. HSCS: a hybrid shared cache scheduling scheme for multiprogrammed workloads[J]. Front. Comput. Sci., 2018, 12(6): 1090-1104.
 URL:  
https://academic.hep.com.cn/fcs/EN/10.1007/s11704-017-6349-5
https://academic.hep.com.cn/fcs/EN/Y2018/V12/I6/1090
1 Chou C, Jaleel A, Qureshi M K. BEAR: techniques for mitigating bandwidth bloat in gigascale DRAM caches. In: Proceedings of the 42nd Annual International Symposium on Computer Architecture. 2016, 198–210
2 Lee Y, Kim J, Jang H, Yang H, Kim J, Jeong J, Lee J W. A fully associative, tagless DRAM cache. In: Proceedings of ACM/IEEE International Symposium on Computer Architecture. 2015, 211–222
https://doi.org/10.1145/2749469.2750383
3 Hameed F, Bauer L, Henkel J. Adaptive cache management for a combined SRAMand DRAM cache hierarchy for multi-cores. In: Proceedings of the Conference on Design, Automation and Test in Europe. 2013, 77–82
4 Hundal R, Oklobdzija V G. Determination of optimal sizes for a first and second level SRAM-DRAM on-chip cache combination. In: Proceedings of IEEE International Conference on Computer Design: VLSI in Computers and Processors. 1994, 60–64
https://doi.org/10.1109/ICCD.1994.331854
5 Qureshi M K, Loh G H. Fundamental latency trade-off in architecting DRAM caches: outperforming impractical SRAM-tags with a simple and practical design. In: Proceedings of IEEE/ACMInternational Symposium on Microarchitecture. 2012, 235–246
https://doi.org/10.1109/MICRO.2012.30
6 Huang C C, Nagarajan V. ATCache: reducing DRAMcache latency via a small SRAM tag cache. In: Proceedings of International Conference on Parallel Architectures and Compilation. 2014, 51–60
https://doi.org/10.1145/2628071.2628089
7 Hameed F, Bauer L, Henkel J. Reducing latency in an SRAM/DRAMcache hierarchy via a novel Tag-Cache architecture. In: Proceedings of the 51st Annual Design Automation Conference. 2014, 1–6
8 Andrade D, Fraguela B B, Doallo R. Accurate prediction of the behavior of multithreaded applications in shared caches. Parallel Computing, 2013, 39(1): 36–57
https://doi.org/10.1016/j.parco.2012.11.003
9 Manikantan R, Rajan K, Govindarajan R. Probabilistic shared cache management (PriSM). In: Proceedings of the 39th IEEE International Symposium on Computer Architecture. 2012, 428–439
https://doi.org/10.1109/ISCA.2012.6237037
10 Wei W, Jiang D, Xiong J, Chen M. HAP: hybrid-memory-aware partition in shared last-level cache. In: Proceedings of IEEE International Conference on Computer Design. 2014, 28–35
https://doi.org/10.1109/ICCD.2014.6974658
11 Holey A, Mekkat V, Yew P C, Zhai A. Performance-energy considerations for shared cache management in a heterogeneous multicore processor. ACM Transactions on Architecture & Code Optimization, 2015, 12(1): 1–29
https://doi.org/10.1145/2710019
12 El-Moursy A, Sibai F N. V-Set cache: an efficient adaptive shared cache for multi-core processors. Journal of Circuits System & Computers, 2014, 23(23): 815–822
https://doi.org/10.1142/S0218126614500959
13 Zhang D, Ju L, Zhao M, Gao X, Jia Z. Write-back aware shared lastlevel cache management for hybrid main memory. In: Proceedings of the 53rd Design Automation Conference. 2016
14 Elhelw A S, Moursy A E, Fahmy H A H. Time-based least memory intensive scheduling. In: Proceedings of the 8th IEEE International Symposium on Embedded Multicore/Manycore Systems-on-Chip. 2014, 311–318
https://doi.org/10.1109/MCSoC.2014.50
15 Elhelw A S, El-Moursy A, Fahmy H A H. Adaptive time-based least memory intensive scheduling. In: Proceedings of the 9th IEEE International Symposium on Embedded Multicore/Manycore Systems-on-Chip. 2015, 167–174
https://doi.org/10.1109/MCSoC.2015.13
16 Chen Q, Zheng L, Guo M. DWS: demand-aware work-stealing in multi-programmed multi-core architectures. In: Proceedings of Programming Models and Applications on Multicores and Manycores. 2014
17 Chen X, Xu C, Dick R P, Mao Z M. Performance and power modeling in a multi-programmed multi-core environment. In: Proceedings of the 47th Design Automation Conference. 2010, 813–818
https://doi.org/10.1145/1837274.1837479
18 Roscoe B, Herlev M, Liu C. Auto-tuning multi-programmed workload on the SCC. In: Proceedings of International Green Computing Conference. 2013, 1–5
https://doi.org/10.1109/IGCC.2013.6604486
19 Huang C, Ravi S, Raghunathan A, Jha N K. Synthesis of heterogeneous distributed architectures for memory-intensive applications. In: Proceedings of International Conference on Computer Aided Design. 2003, 46–53
20 Huang C, Ravi S, Raghunathan A, Jha N K. Generation of heterogeneous distributed architectures for memory-intensive applications through high-level synthesis. IEEE Transactions on Very Large Scale Integration Systems, 2007, 15(11): 1191–1204
https://doi.org/10.1109/TVLSI.2007.904096
21 Castellana V G, Ferrandi F. Abstract: speeding-up memory intensive applications through adaptive hardware accelerators. In: Proceedings of SC Companion: High Performance Computing, Networking Storage and Analysis. 2012, 1415–1416
https://doi.org/10.1109/SC.Companion.2012.226
22 Yi W, Tang Y, Wang G, Fang X. A case study of SWIM: optimization of memory intensive application on GPGPU. In: Proceedings of International Symposium on Parallel Architectures, Algorithms and Programming. 2010, 123–129
23 Athanasaki E, Anastopoulos N, Kourtis K, Koziris N. Exploring the performance limits of simultaneous multithreading for memory intensive applications. Journal of Supercomputing, 2008, 44(1): 64–97
https://doi.org/10.1007/s11227-007-0149-x
24 Chun K C, Jain P, Kim C H. Logic-compatible embedded DRAM design for memory intensive low power systems. In: Proceedings of IEEE International Symposium on Circuits and Systems. 2010, 277–280
https://doi.org/10.1109/ISCAS.2010.5537877
25 Jaleel A, Nuzman J, Moga A, Steely S C, Emer J. High performing cache hierarchies for server workloads: relaxing inclusion to capture the latency benefits of exclusive caches. In: Proceedings of IEEE International Symposium on High Performance Computer Architecture. 2015, 343–353
https://doi.org/10.1109/HPCA.2015.7056045
26 Akin B, Franchetti F, Hoe J C. Data reorganization in memory using 3D-stacked DRAM. In: Proceedings of ACM/IEEE International Symposium on Computer Architecture. 2015, 131–143
https://doi.org/10.1145/2749469.2750397
27 Jevdjic D, Volos S, Falsafi B. Die-stacked DRAM caches for servers: hit ratio, latency, or bandwidth? have it all with footprint cache. In: Proceedings of ACM/IEEE International Symposium on Computer Architecture. 2013, 404–415
https://doi.org/10.1145/2485922.2485957
28 Oskin M, Loh G H. A software-managed approach to die-stacked DRAM. In: Proceedings of International Conference on Parallel Architecture and Compilation. 2015, 188–200
https://doi.org/10.1109/PACT.2015.30
29 Mekkat V, Holey A, Yew P C, Zhai A. Managing shared last-level cache in a heterogeneous multicore processor. In: Proceedings of International Conference on Parallel Architectures & Compilation Techniques. 2013, 225–234
30 Lee M, Kim S. Performance-controllable shared cache architecture for multi-core soft real-time systems. In: Proceedings of IEEE International Conference on Computer Design. 2013, 519–522
https://doi.org/10.1109/ICCD.2013.6657097
31 Pan A, Pai V S. Runtime-driven shared last-level cache management for task-parallel programs. In: Proceedings of International Conference for High Performance Computing, Networking, Storage and Analysis. 2015, 1–12
https://doi.org/10.1145/2807591.2807625
32 Albericio J, Ibáñez P, Viñals V, Llabería J M. The reuse cache: downsizing the shared last-level cache. In: Proceedings of IEEE/ACM International Symposium on Microarchitecturee. 2013, 310–321
https://doi.org/10.1145/2540708.2540735
33 Loh G H, Hill M D. Supporting very large DRAM caches with compound-access scheduling and MissMap. IEEE Micro, 2012, 32(3): 70–78
https://doi.org/10.1109/MM.2012.25
34 Loh G H, Hill M D. Efficiently enabling conventional block sizes for very large die-stacked DRAM caches. In: Proceedings of IEEE/ACM International Symposium on Microarchitecture. 2011, 454–464
https://doi.org/10.1145/2155620.2155673
35 Dong H W, Seong N H, Lee H H S. Pragmatic integration of an SRAM row cache in heterogeneous 3-D DRAM architecture using TSV. IEEE Transactions on Very Large Scale Integration Systems, 2013, 21(1): 1–13
https://doi.org/10.1109/TVLSI.2011.2176761
36 Chen Q, Zheng L, Guo M. Adaptive demand-aware work-stealing in multi-programmed multi-core architectures. Concurrency & Computation: Practice & Experience, 2016, 28(2): 455–471
https://doi.org/10.1002/cpe.3619
37 Suo G, Yang X. System level speedup oriented cache partitioning for multi-programmed systems. In: Proceedings of IFIP International Conference on Network and Parallel Computing. 2009, 204–210
https://doi.org/10.1109/NPC.2009.9
38 Kirovski D, Lee C, Potkonjak M, Mangione-Smith W H. Applicationdriven synthesis of memory-intensive systems-on-chip. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 1999, 18(9): 1316–1326
https://doi.org/10.1109/43.784123
39 Sim J, Loh G H, Sridharan V, O'Connor M. A configurable and strong RAS solution for die-stacked DRAMcaches. IEEEMicro, 2014, 34(3): 80–90
https://doi.org/10.1109/MM.2014.13
40 Lin B, Li S, Liao X, Zhang J. Leach: an automatic learning cache for inline primary deduplication system. Frontiers of Computer Science, 2014, 8(2): 175–183.
https://doi.org/10.1007/s11704-014-3377-2
41 Chou C, Jaleel A, Qureshi M K. CAMEO: a two-level memory organization with capacity of main memory and flexibility of hardwaremanaged cache. In: Proceedings of IEEE/ACM International Symposium on Microarchitecture. 2014, 1–12
https://doi.org/10.1109/MICRO.2014.63
42 Ou J, Patton M, Moore M D, Xu Y, Jiang S. A penalty aware memory allocation scheme for key-value cache. In: Proceedings of International Conference on Parallel Processing. 2015, 530–539
https://doi.org/10.1109/ICPP.2015.62
43 Woo H D, Seong N H, Lewis D L, Lee H H S. An optimized 3D-stacked memory architecture by exploiting excessive, high-density TSV bandwidth. In: Proceedings of the 16th IEEE International Symposium on High Performance Computer Architecture. 2010, 1–12
https://doi.org/10.1109/HPCA.2010.5416628
44 Loh G H. Extending the effectiveness of 3D-stacked DRAM caches with an adaptive multi-queue policy. In: Proceedings of IEEE/ACM International Symposium on Microarchitecture. 2009, 201–212
https://doi.org/10.1145/1669112.1669139
45 Jiang L, Liu Y, Duan L, Xie Y, Xu Q. Modeling TSV open defects in 3D-stacked DRAM. In: Proceedings of IEEE International Test Conference. 2010, 174–182
https://doi.org/10.1109/TEST.2010.5699217
46 Hennessy J L, Patterson D A. Computer Architecture: A Quantitative Approach. Elsevier, 2011
47 Li S, Cheng B, Gao X, Qiao L, Tang Z. Performance characterization of SPEC CPU2006 benchmarks on Intel and AMD platform. In: Proceedings of IEEE International Workshop on Education Technology & Computer Science. 2009, 116–121
https://doi.org/10.1109/ETCS.2009.288
48 Sim J, Loh G H, Kim H, O'Connor M. A mostly-clean DRAM cache for effective hit speculation and self-balancing dispatch. In: Proceedings of IEEE/ACM International Symposium on Microarchitecture. 2012, 247–257
https://doi.org/10.1109/MICRO.2012.31
49 Begum R, Hempstead M. Power-agility metrics: measuring dynamic characteristics of energy proportionality. In: Proceedings of IEEE International Conference on Computer Design. 2015, 643–650
https://doi.org/10.1109/ICCD.2015.7357176
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed