Please wait a minute...
Frontiers of Computer Science

ISSN 2095-2228

ISSN 2095-2236(Online)

CN 10-1014/TP

邮发代号 80-970

2019 Impact Factor: 1.275

Frontiers of Computer Science  2025, Vol. 19 Issue (5): 195107   https://doi.org/10.1007/s11704-023-3566-y
  本期目录
MuDP: multi-granularity data placement for uniform loops on SPM-DRAM architectures to minimize latency
Yixuan DU1, Edwin Hsing-Mean SHA1, Yuhong SONG1, Yibo GUO2, Longshan XU1, Qingfeng ZHUGE1()
1. School of Computer Science and Technology, East China Normal University, Shanghai 200062, China
2. School of Computer and Artificial Intelligence, Zhengzhou University, Zhengzhou 450001, China
 全文: PDF(5990 KB)   HTML
Abstract

Scratch-pad memory (SPM) has been widely used in embedded systems because it allows software-controlled data placement. By designing data placement strategies, optimal solutions with minimal memory access latency for loops on SPM-DRAM architecture can be explored. Although existing works effectively reduce the latency by using fine-grained data placement methods, they fail in solving the case of inconsecutive array access. Meanwhile, fine-grained strategy can lead to excessive memory activation overhead, making it less efficient. Therefore, in this paper, we first propose a fine-grained dynamic programming algorithm, called FiDP, to tackle unsolved case and minimize latency. In order to mitigate the frequent activation before data access, we then add a medium-grained scheme to our strategy. It can achieve a better solution than FiDP by strictly formulating an integer linear programming (ILP) problem and considering multiple granularities, which is called MuILP. Furthermore, to compensate for the high time complexity of ILP, we develop a heuristic multi-granularity data placement algorithm, called HMuDP, which achieves a near-optimal solution with lower complexity. Experimental results show that our FiDP reduces the total latency by 75.90%, 47.70% and 12.34% compared with LRU-cache, a greedy-based comparison method (called Uday) and a dynamic programming-based comparison method (called DLAA). Besides, our MuILP and HMuDP yield less latency than FiDP with 45.10% and 43.14% average improvement, respectively.

Key wordsscratch-pad memory    data placement    loops    embedded system
收稿日期: 2023-07-14      出版日期: 2024-07-10
Corresponding Author(s): Qingfeng ZHUGE   
 引用本文:   
. [J]. Frontiers of Computer Science, 2025, 19(5): 195107.
Yixuan DU, Edwin Hsing-Mean SHA, Yuhong SONG, Yibo GUO, Longshan XU, Qingfeng ZHUGE. MuDP: multi-granularity data placement for uniform loops on SPM-DRAM architectures to minimize latency. Front. Comput. Sci., 2025, 19(5): 195107.
 链接本文:  
https://academic.hep.com.cn/fcs/CN/10.1007/s11704-023-3566-y
https://academic.hep.com.cn/fcs/CN/Y2025/V19/I5/195107
Fig.1  
Fig.2  
Variable Read Write
A[i+1] 2 2
A[i] 2 1
A[i?1] 1 0
B 3 1
C[i] 0 1
C[i?2] 2 0
C[i?4] 2 0
D[i] 0 1
Tab.1  
Operation Latency Energy
read/write one unit in SPM 1 1
read/write one unit in main memory 50 300
move one unit between SPM and MM 51 301
move one block between SPM and MM 74 412
Tab.2  
Variable Before Placement strategy
A[i+1] MM MM→SPM, access, remain in SPM
A[i] SPM access, remain in SPM
A[i?1] SPM access, SPM→MM
B SPM access, remain in SPM
C[i] MM access, remain in MM
C[i?2] MM MM→SPM, access, remain in SPM
C[i?4] SPM access, SPM→MM
D[i] MM access, remain in MM
Tab.3  
Fig.3  
Fig.4  
  
Component Specification
CPU Number of cores: 1, frequency: 1.0 GHz
on-chip SPM 2 KB SRAM, read/write latency: 0.289 ns, read/write energy: 0.011 nJ
Main memory 512 MB DDR SDRAM, read/write latency: 12.249 ns, read/write energy: 3.24 nJ
Tab.4  
Fig.5  
Fig.6  
Fig.7  
Fig.8  
Fig.9  
Fig.10  
Benchmark #Arrays Average access of array elements Optimal block size (word)
firlms2 26 1.5 8
w_vec 48 1 5
dotprod 42 1 5
autocor 15 2 6
fft_bit_reduct 7 12 11
all_pole 8 5 10
lattice 6 1 8
iir 9 4 10
conv 9 3 9
pool 15 3 9
Tab.5  
  
  
  
  
  
  
1 A, VanHattum R, Nigam V T, Lee J, Bornholt A Sampson . A synthesis-aided compiler for DSP architectures (WiP Paper). In: Proceedings of the 21st ACM SIGPLAN/SIGBED Conference on Languages, Compilers, and Tools for Embedded Systems. 2020, 131−135
2 Y, Song W, Jiang B, Li P, Qi Q, Zhuge E H M, Sha S, Dasgupta Y, Shi C Ding . Dancing along battery: enabling transformer with run-time reconfigurability on mobile devices. In: Proceedings of the 58th ACM/IEEE Design Automation Conference (DAC). 2021, 1003−1008
3 X H, Sun L M Ni . Another view on parallel speedup. In: Proceedings of 1990 ACM/IEEE conference on Supercomputing. 1990, 324−333
4 M, Wu Y, Liu H, Cui Q, Wei Q, Li L, Li F, Lv J, Xue X Feng . Bandwidth-aware loop tiling for DMA-supported scratchpad memory. In: Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques. 2020, 97−109
5 V, Venkataramani M C, Chan T Mitra . Scratchpad-memory management for multi-threaded applications on many-core architectures. ACM Transactions on Embedded Computing Systems, 2019, 18( 1): 10
6 Y, Guo Q, Zhuge J, Hu J, Yi M, Qiu E H M Sha . Data placement and duplication for embedded multicore systems with scratch pad memory. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2013, 32( 6): 809–817
7 Y, Guo Q, Zhuge J, Zhang J, Hu E H M Sha . Optimal data allocation algorithm for loop-centric applications on scratch-PAD memories. In: Proceedings of 2013 SiPS. 2013, 383−388
8 A, Singh S, Dave P, Zardoshti R, Brotzman C, Zhang X, Guo A, Shrivastava G, Tan M Spear . SPX64: a scratchpad memory for general-purpose microprocessors. ACM Transactions on Architecture and Code Optimization, 2021, 18( 1): 14
9 A E Şuşu . A vector-length agnostic compiler for the Connex-S accelerator with scratchpad memory. ACM Transactions on Embedded Computing Systems, 2020, 19( 6): 51
10 H J, Jeong J, Yeo C, Bahk J Park . Pin or fuse? Exploiting scratchpad memory to reduce off-chip data transfer in DNN accelerators. In: Proceedings of the 21st ACM/IEEE International Symposium on Code Generation and Optimization. 2023, 224−235
11 R, Xu E H M, Sha Q, Zhuge Y, Song H Wang . Loop interchange and tiling for multi-dimensional loops to minimize write operations on NVMs. Journal of Systems Architecture, 2023, 135: 102799
12 S, Udayakumaran A, Dominguez R Barua . Dynamic allocation for scratch-pad memory using compile-time decisions. ACM Transactions on Embedded Computing Systems, 2006, 5( 2): 472–511
13 J, Li Z, Zhang M, Qiu P, Zhang G, Quan Y Zhu . Optimizing scheduling in embedded CMP systems with phase change memory. In: Proceedings of the 18th IEEE International Conference on Parallel and Distributed Systems. 2012, 532−539
14 Y, Li J, Zhan W, Jiang Y Li . Writing-aware data variable allocation on hybrid SRAM+NVM SPM: work-in-progress. In: Proceedings of the International Conference on Compilers, Architecture and Synthesis for Embedded Systems (CASES). 2018, 11
15 Y, Li J, Zhan W, Jiang J Yu . Energy optimization of branch-aware data variable allocation on hybrid SRAM+NVM SPM for CPS. In: Proceedings of the 34th ACM/SIGAPP Symposium on Applied Computing. 2019, 236−241
16 J, Zhan Y, Li W, Jiang J, Yu J Yu . Branch-aware data variable allocation for energy optimization of hybrid SRAM+NVM SPM. Journal of Systems Architecture, 2020, 109: 101797
17 R, Xu E H M, Sha Q, Zhuge Y, Song H, Wang L Shi . Optimizing data placement for hybrid SRAM+racetrack memory SPM in embedded systems. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2023, 42( 3): 847–859
18 Q, Zhuge Y, Guo J, Hu W C, Tseng C J, Xue E H M Sha . Minimizing access cost for multiple types of memory units in embedded systems through data allocation and scheduling. IEEE Transactions on Signal Processing, 2012, 60( 6): 3253–3263
19 V, Suhendra C, Raghavan T Mitra . Integrated scratchpad memory optimization and task scheduling for MPSoC architectures. In: Proceedings of 2006 International Conference on Compilers, Architecture and Synthesis for Embedded Systems. 2006, 401−410
20 G, Wang L, Ju Z, Jia X Li . Data allocation for embedded systems with hybrid on-chip scratchpad and caches. In: Proceedings of the 10th International Conference on High Performance Computing and Communications & 2013 IEEE International Conference on Embedded and Ubiquitous Computing. 2013, 366−373
21 N, Muralimanohar R, Balasubramonian N Jouppi . Optimizing nuca organizations and wiring alternatives for large caches with cacti 6.0. In: Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007). 2007, 3–14
[1] FCS-23566-OF-YD_suppl_1 Download
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed