MuDP: multi-granularity data placement for uniform loops on SPM-DRAM architectures to minimize latency

doi:10.1007/s11704-023-3566-y

Frontiers of Computer Science

2025, Vol. 19

Issue (5): 195107 https://doi.org/10.1007/s11704-023-3566-y

本期目录

MuDP: multi-granularity data placement for uniform loops on SPM-DRAM architectures to minimize latency

Yixuan DU¹, Edwin Hsing-Mean SHA¹, Yuhong SONG¹, Yibo GUO², Longshan XU¹, Qingfeng ZHUGE¹(

)

¹. School of Computer Science and Technology, East China Normal University, Shanghai 200062, China
². School of Computer and Artificial Intelligence, Zhengzhou University, Zhengzhou 450001, China

全文: PDF(5990 KB) HTML

Abstract：

Scratch-pad memory (SPM) has been widely used in embedded systems because it allows software-controlled data placement. By designing data placement strategies, optimal solutions with minimal memory access latency for loops on SPM-DRAM architecture can be explored. Although existing works effectively reduce the latency by using fine-grained data placement methods, they fail in solving the case of inconsecutive array access. Meanwhile, fine-grained strategy can lead to excessive memory activation overhead, making it less efficient. Therefore, in this paper, we first propose a fine-grained dynamic programming algorithm, called FiDP, to tackle unsolved case and minimize latency. In order to mitigate the frequent activation before data access, we then add a medium-grained scheme to our strategy. It can achieve a better solution than FiDP by strictly formulating an integer linear programming (ILP) problem and considering multiple granularities, which is called MuILP. Furthermore, to compensate for the high time complexity of ILP, we develop a heuristic multi-granularity data placement algorithm, called HMuDP, which achieves a near-optimal solution with lower complexity. Experimental results show that our FiDP reduces the total latency by 75.90%, 47.70% and 12.34% compared with LRU-cache, a greedy-based comparison method (called Uday) and a dynamic programming-based comparison method (called DLAA). Besides, our MuILP and HMuDP yield less latency than FiDP with 45.10% and 43.14% average improvement, respectively.

Key words： scratch-pad memory data placement loops embedded system

收稿日期: 2023-07-14 出版日期: 2024-07-10

Corresponding Author(s): Qingfeng ZHUGE

引用本文:

. [J]. Frontiers of Computer Science, 2025, 19(5): 195107.
Yixuan DU, Edwin Hsing-Mean SHA, Yuhong SONG, Yibo GUO, Longshan XU, Qingfeng ZHUGE. MuDP: multi-granularity data placement for uniform loops on SPM-DRAM architectures to minimize latency. Front. Comput. Sci., 2025, 19(5): 195107.

链接本文:

https://academic.hep.com.cn/fcs/CN/10.1007/s11704-023-3566-y
https://academic.hep.com.cn/fcs/CN/Y2025/V19/I5/195107

Fig.1

Fig.2

Tab.1

Tab.2

Tab.3

Fig.3

Fig.4

Tab.4

Fig.5

Fig.6

Fig.7

Fig.8

Fig.9

Fig.10

Tab.5

1	A, VanHattum R, Nigam V T, Lee J, Bornholt A Sampson . A synthesis-aided compiler for DSP architectures (WiP Paper). In: Proceedings of the 21st ACM SIGPLAN/SIGBED Conference on Languages, Compilers, and Tools for Embedded Systems. 2020, 131−135
2	Y, Song W, Jiang B, Li P, Qi Q, Zhuge E H M, Sha S, Dasgupta Y, Shi C Ding . Dancing along battery: enabling transformer with run-time reconfigurability on mobile devices. In: Proceedings of the 58th ACM/IEEE Design Automation Conference (DAC). 2021, 1003−1008
3	X H, Sun L M Ni . Another view on parallel speedup. In: Proceedings of 1990 ACM/IEEE conference on Supercomputing. 1990, 324−333
4	M, Wu Y, Liu H, Cui Q, Wei Q, Li L, Li F, Lv J, Xue X Feng . Bandwidth-aware loop tiling for DMA-supported scratchpad memory. In: Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques. 2020, 97−109
5	V, Venkataramani M C, Chan T Mitra . Scratchpad-memory management for multi-threaded applications on many-core architectures. ACM Transactions on Embedded Computing Systems, 2019, 18( 1): 10
6	Y, Guo Q, Zhuge J, Hu J, Yi M, Qiu E H M Sha . Data placement and duplication for embedded multicore systems with scratch pad memory. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2013, 32( 6): 809–817
7	Y, Guo Q, Zhuge J, Zhang J, Hu E H M Sha . Optimal data allocation algorithm for loop-centric applications on scratch-PAD memories. In: Proceedings of 2013 SiPS. 2013, 383−388
8	A, Singh S, Dave P, Zardoshti R, Brotzman C, Zhang X, Guo A, Shrivastava G, Tan M Spear . SPX64: a scratchpad memory for general-purpose microprocessors. ACM Transactions on Architecture and Code Optimization, 2021, 18( 1): 14
9	A E Şuşu . A vector-length agnostic compiler for the Connex-S accelerator with scratchpad memory. ACM Transactions on Embedded Computing Systems, 2020, 19( 6): 51
10	H J, Jeong J, Yeo C, Bahk J Park . Pin or fuse? Exploiting scratchpad memory to reduce off-chip data transfer in DNN accelerators. In: Proceedings of the 21st ACM/IEEE International Symposium on Code Generation and Optimization. 2023, 224−235
11	R, Xu E H M, Sha Q, Zhuge Y, Song H Wang . Loop interchange and tiling for multi-dimensional loops to minimize write operations on NVMs. Journal of Systems Architecture, 2023, 135: 102799
12	S, Udayakumaran A, Dominguez R Barua . Dynamic allocation for scratch-pad memory using compile-time decisions. ACM Transactions on Embedded Computing Systems, 2006, 5( 2): 472–511
13	J, Li Z, Zhang M, Qiu P, Zhang G, Quan Y Zhu . Optimizing scheduling in embedded CMP systems with phase change memory. In: Proceedings of the 18th IEEE International Conference on Parallel and Distributed Systems. 2012, 532−539
14	Y, Li J, Zhan W, Jiang Y Li . Writing-aware data variable allocation on hybrid SRAM+NVM SPM: work-in-progress. In: Proceedings of the International Conference on Compilers, Architecture and Synthesis for Embedded Systems (CASES). 2018, 11
15	Y, Li J, Zhan W, Jiang J Yu . Energy optimization of branch-aware data variable allocation on hybrid SRAM+NVM SPM for CPS. In: Proceedings of the 34th ACM/SIGAPP Symposium on Applied Computing. 2019, 236−241
16	J, Zhan Y, Li W, Jiang J, Yu J Yu . Branch-aware data variable allocation for energy optimization of hybrid SRAM+NVM SPM. Journal of Systems Architecture, 2020, 109: 101797
17	R, Xu E H M, Sha Q, Zhuge Y, Song H, Wang L Shi . Optimizing data placement for hybrid SRAM+racetrack memory SPM in embedded systems. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2023, 42( 3): 847–859
18	Q, Zhuge Y, Guo J, Hu W C, Tseng C J, Xue E H M Sha . Minimizing access cost for multiple types of memory units in embedded systems through data allocation and scheduling. IEEE Transactions on Signal Processing, 2012, 60( 6): 3253–3263
19	V, Suhendra C, Raghavan T Mitra . Integrated scratchpad memory optimization and task scheduling for MPSoC architectures. In: Proceedings of 2006 International Conference on Compilers, Architecture and Synthesis for Embedded Systems. 2006, 401−410
20	G, Wang L, Ju Z, Jia X Li . Data allocation for embedded systems with hybrid on-chip scratchpad and caches. In: Proceedings of the 10th International Conference on High Performance Computing and Communications & 2013 IEEE International Conference on Embedded and Ubiquitous Computing. 2013, 366−373
21	N, Muralimanohar R, Balasubramonian N Jouppi . Optimizing nuca organizations and wiring alternatives for large caches with cacti 6.0. In: Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007). 2007, 3–14

[1]

FCS-23566-OF-YD_suppl_1

Download

Viewed

Full text

Abstract

Cited

Shared

Discussed