Please wait a minute...
Frontiers of Computer Science

ISSN 2095-2228

ISSN 2095-2236(Online)

CN 10-1014/TP

Postal Subscription Code 80-970

2018 Impact Factor: 1.129

Front. Comput. Sci.    2024, Vol. 18 Issue (5) : 185607    https://doi.org/10.1007/s11704-023-3397-x
RESEARCH ARTICLE
ARCHER: a ReRAM-based accelerator for compressed recommendation systems
Xinyang SHEN, Xiaofei LIAO, Long ZHENG(), Yu HUANG, Dan CHEN, Hai JIN
National Engineering Research Center for Big Data Technology and System, Services Computing Technology and System Lab, Clusters and Grid Computing Lab, School of Computer Science and Technology, Huazhong University of Science and Technology, Wuhan 430074, China
 Download: PDF(7144 KB)   HTML
 Export: BibTeX | EndNote | Reference Manager | ProCite | RefWorks
Abstract

Modern recommendation systems are widely used in modern data centers. The random and sparse embedding lookup operations are the main performance bottleneck for processing recommendation systems on traditional platforms as they induce abundant data movements between computing units and memory. ReRAM-based processing-in-memory (PIM) can resolve this problem by processing embedding vectors where they are stored. However, the embedding table can easily exceed the capacity limit of a monolithic ReRAM-based PIM chip, which induces off-chip accesses that may offset the PIM profits. Therefore, we deploy the decomposed model on-chip and leverage the high computing efficiency of ReRAM to compensate for the decompression performance loss. In this paper, we propose ARCHER, a ReRAM-based PIM architecture that implements fully on-chip recommendations under resource constraints. First, we make a full analysis of the computation pattern and access pattern on the decomposed table. Based on the computation pattern, we unify the operations of each layer of the decomposed model in multiply-and-accumulate operations. Based on the access observation, we propose a hierarchical mapping schema and a specialized hardware design to maximize resource utilization. Under the unified computation and mapping strategy, we can coordinate the inter-processing elements pipeline. The evaluation shows that ARCHER outperforms the state-of-the-art GPU-based DLRM system, the state-of-the-art near-memory processing recommendation system RecNMP, and the ReRAM-based recommendation accelerator REREC by 15.79×, 2.21×, and 1.21 × in terms of performance and 56.06 ×, 6.45×, and 1.71 × in terms of energy savings, respectively.

Keywords recommendation system      ReRAM      processing-in-memory      embedding layer     
Corresponding Author(s): Long ZHENG   
Just Accepted Date: 09 October 2023   Issue Date: 15 December 2023
 Cite this article:   
Xinyang SHEN,Xiaofei LIAO,Long ZHENG, et al. ARCHER: a ReRAM-based accelerator for compressed recommendation systems[J]. Front. Comput. Sci., 2024, 18(5): 185607.
 URL:  
https://academic.hep.com.cn/fcs/EN/10.1007/s11704-023-3397-x
https://academic.hep.com.cn/fcs/EN/Y2024/V18/I5/185607
Fig.1  Chip scale review of representative ReRAM accelerators. The blue triangle and the green lozenge mark the scale of the analog and digital parts of ReRAM device in the accelerator, respectively
Fig.2  Overview of (a) DLRM model. (b) Example of embedding tale T 0 compressed (decomposed) into 3 TT cores. (c) An example of a decompression (composition) process where a 4×32 core row r 1 in TT core 1 multiplies a 32×4×32 core row r2 in TT core 2. The intermediate result IR is a 32×4 ×4 tensor, which multiplies a 4×32 core row in TT core 3 to get a resulting 4×4×4 tensor R. Eventually, R is reshaped into a 1×64 vector OR, the original row in the embedding table
Fig.3  Access patterns of original tables and TT cores. (a) Original table access frequency. The Original row represents the row index in the original table. The Frequency represents the access frequency of a range of rows. (b) TT core access frequency. The TT core represents the core index d. The Core row represents the core row index
Fig.4  The overview of ARCHER: chip, tile, and ARU
Fig.5  The workflow of ARCHER
Fig.6  Model mapping of ARCHER. (a) Three original tables T1,T2 and T3: A sample query on the target row on T1 and the corresponding process of core row cr 1, cr 2 and cr 3 on the compressed T1 (C-T1) composed of TT core 1, TT core 2 and TT core 3. (b) An example of layer-wise model mapping and index-based core mapping for C-T1,C-T2, and C-T3. (c) Mapping process of cr 1, cr 2 and cr 3 on the ReRAM hardware
Fig.7  An example of pipeline execution of ARCHER for 3 batches. Five main pipeline operations are depicted: first-stage composition (FSC), bottom MLP (B-MLP), second-stage composition (SSC), interaction, and top-MLP (T-MLP)
CPU Intel Xeon CPU E5-2680 v4, 28 cores, 2.4 GHz
Cache L1 64 KB, L2 256 KB, L3 35 MB
Main memory 256 GB DDR4
GPU Tesla P100, 56 SMs × 64 cores, 1.33 GHz
Cache L1 64 KB/SM, L2 4 MB
GPU main memory 16 GB HBM2
RecNMP 4 channel × 1 DIMM-NMP × 2 Rank-NMPs
Memory 64 GB DRAM
REREC 419 MB MLP/Memory arrays, 1048 KB Inner-product arrays
Tab.1  Hardware configurations of CPU, GPU, RecNMP and REREC
Model Emb size MLP size Total
DCKD Original, 2.16 GB 1.15 MB 2.16 GB
DCSD Original, 39.31 MB 1.15 MB 40.46 MB
deDCKD T1 cores, 0.30 MB 1.15 MB 12.43 MB
T2 cores, 10.28 MB
T3 cores, 0.70 MB
Tab.2  Models properties
Model RMC0 RMC1 RMC2 RMC3 RMC4 RMC5 RMC6
Emb. 1× 0.5 × 2× 4× 1× 1× 1×
MLP 1× 1× 1× 1× 0.5 × 2× 4×
Size (GBs) 2.1 1.1 4.1 8.1 2.1 2.1 2.1
Tab.3  Configurable model parameters
Component Param. Spec. Pow. (mW) Area (mm2)
ARU properties (8 ARUs per tile)
ADC Number 32 64 0.00384
Resolution 8 b
DAC Number 32 × 64 8 0.00034
Resolution 2 b
S&H Number 32 × 64 0.020 0.000080
Crossbar Number 32 6.2 0.0005
Size 64 × 64
bits/cell 2
S&A Number 16 0.80 0.00096
IR Size 4 KB 2.32 0.0038
OR Size 512 B 0.42 0.0014
Tile properties (64 PEs per chip)
ARU Total Number 8 653.92 0.36384
Memory ReRAM Size 2 MB 35.078 0.3440
EDRAM Size 64 KB 20.7 0.083
Sigmoid Number 2 0.52 0.0006
Chip properties
Tile Total 45.454K 50.652
NoC Flit_size 128 b 75 0.58
Chip Total 45.529K 51.232
Tab.4  Hardware configurations of ARCHER
Fig.8  Performance of ARCHER with different embedding layers. ‘OoM’ indicates that the model request space is out of the GPU memory
Fig.9  Performance of ARCHER with different MLP layers
Fig.10  Energy saving of ARCHER with different embedding layers
Fig.11  Energy of ARCHER with different MLP layers
Fig.12  Performance of ARCHER against REREC
Fig.13  Energy of ARCHER against REREC
Fig.14  Performance of ARCHER with different mapping schemas
Fig.15  Performance of ARCHER with different ranks
Fig.16  Chip area breakdown of ARCHER
  
  
  
  
  
  
1 Ke L, Gupta U, Cho B Y, Brooks D, Chandra V, Diril U, Firoozshahian A, Hazelwood K, Jia B, Lee H H S, Li M, Maher B, Mudigere D, Naumov M, Schatz M, Smelyanskiy M, Wang X, Reagen B, Wu C J, Hempstead M, Zhang X. RecNMP: Accelerating personalized recommendation with near-memory processing. In: Proceedings of the 47th ACM/IEEE Annual International Symposium on Computer Architecture. 2020, 790−803
2 M, Naumov D, Mudigere H J M, Shi J, Huang N, Sundaraman J, Park X, Wang U, Gupta C J, Wu A G, Azzolini D, Dzhulgakov A, Mallevich I, Cherniavskii Y, Lu R, Krishnamoorthi A, Yu V, Kondratenko S, Pereira X, Chen W, Chen V, Rao B, Jia L, Xiong M Smelyanskiy . Deep learning recommendation model for personalization and recommendation systems. 2019, arXiv preprint arXiv: 1906.00091
3 U, Gupta C J, Wu X, Wang M, Naumov B, Reagen D, Brooks B, Cottel K, Hazelwood M, Hempstead B, Jia H H S, Lee A, Malevich D, Mudigere M, Smelyanskiy L, Xiong X Zhang . The architectural implications of Facebook’s DNN-based personalized recommendation. In: Proceedings of 2020 IEEE International Symposium on High Performance Computer Architecture. 2020, 488−501
4 J, Wu X, He X, Wang Q, Wang W, Chen J, Lian X Xie . Graph convolution machine for context-aware recommender system. Frontiers of Computer Science, 2022, 16( 6): 166614
5 H, Guo R, Tang Y, Ye Z, Li X, He Z Dong . DeepFM: an end-to-end wide & deep learning framework for CTR prediction. 2018, arXiv preprint arXiv: 1804.04950
6 G, Zhou N, Mou Y, Fan Q, Pi W, Bian C, Zhou X, Zhu K Gai . Deep interest evolution network for click-through rate prediction. In: Proceedings of the 33rd AAAI Conference on Artificial Intelligence. 2019, 5941−5948
7 Hwang R, Kim T, Kwon Y, Rhu M. Centaur: a chiplet-based, hybrid sparse-dense accelerator for personalized recommendations. In: Proceedings of the 47th ACM/IEEE Annual International Symposium on Computer Architecture. 2020, 968−981
8 Kal H, Lee S, Ko G, Ro W W. SPACE: locality-aware processing in heterogeneous memory for personalized recommendations. In: Proceedings of the 48th ACM/IEEE Annual International Symposium on Computer Architecture. 2021, 679−691
9 A, Shafiee A, Nag N, Muralimanohar R, Balasubramonian J P, Strachan M, Hu R S, Williams V Srikumar . ISAAC: a convolutional neural network accelerator with in-situ analog arithmetic in crossbars. ACM SIGARCH Computer Architecture News, 2016, 44( 3): 14–26
10 P, Chi S, Li C, Xu T, Zhang J, Zhao Y, Liu Y, Wang Y Xie . PRIME: a novel processing-in-memory architecture for neural network computation in ReRAM-based main memory. ACM SIGARCH Computer Architecture News, 2016, 44( 3): 27–39
11 Imani M, Gupta S, Kim Y, Rosing T. FloatPIM: in-memory acceleration of deep neural network training with high precision. In: Proceedings of the 46th ACM/IEEE Annual International Symposium on Computer Architecture. 2019, 802−815
12 L, Song Y, Zhuo X, Qian H, Li Y Chen . GraphR: accelerating graph processing using ReRAM. In: Proceedings of 2018 IEEE International Symposium on High Performance Computer Architecture. 2018, 531−543
13 Y, Huang L, Zheng P, Yao J, Zhao X, Liao H, Jin J Xue . A heterogeneous PIM hardware-software co-design for energy-efficient graph processing. In: Proceedings of 2020 IEEE International Parallel and Distributed Processing Symposium. 2020, 684−695
14 L, Zheng J, Zhao Y, Huang Q, Wang Z, Zeng J, Xue X, Liao H Jin . Spara: an energy-efficient ReRAM-based accelerator for sparse graph analytics applications. In: Proceedings of 2020 IEEE International Parallel and Distributed Processing Symposium. 2020, 696−707
15 A I, Arka J R, Doppa P P, Pande B K, Joardar K Chakrabarty . ReGraphX: NoC-enabled 3D heterogeneous ReRAM architecture for training graph neural networks. In: Proceedings of 2021 Design, Automation & Test in Europe Conference & Exhibition. 2021, 1667−1672
16 Zha Y, Li J. Hyper-AP: enhancing associative processing through a full-stack optimization. In: Proceedings of the 47th ACM/IEEE Annual International Symposium on Computer Architecture. 2020, 846−859
17 M, Imani S, Pampana S, Gupta M, Zhou Y, Kim T Rosing . DUAL: acceleration of clustering algorithms using digital-based processing in-memory. In: Proceedings of the 53rd Annual IEEE/ACM International Symposium on Microarchitecture. 2020, 356−371
18 D, Niu C, Xu N, Muralimanohar N P, Jouppi Y Xie . Design of cross-point metal-oxide ReRAM emphasizing reliability and cost. In: Proceedings of 2013 IEEE/ACM International Conference on Computer-Aided Design. 2013, 17−23
19 H S P, Wong H Y, Lee S, Yu Y S, Chen Y, Wu P S, Chen B, Lee F T, Chen M J Tsai . Metal−oxide RRAM. Proceedings of the IEEE, 2012, 100( 6): 1951–1970
20 H, Li H, Jin L, Zheng Y, Huang X Liao . ReCSA: a dedicated sort accelerator using ReRAM-based content addressable memory. Frontiers of Computer Science, 2023, 17( 2): 172103
21 Yin C, Acun B, Wu C J, Liu X. TT-Rec: Tensor train compression for deep learning recommendation models. 2021, arXiv preprint arXiv: 2101.11714
22 M, Hu J P, Strachan Z, Li E M, Grafals N, Davila C, Graves S, Lam N, Ge J J, Yang R S Williams . Dot-product engine for neuromorphic computing: programming 1T1M crossbar to accelerate matrix-vector multiplication. In: Proceedings of the 53rd ACM/EDAC/IEEE Design Automation Conference. 2016, 1−6
23 Xu C, Niu D, Muralimanohar N, Balasubramonian R, Zhang T, Yu S, Xie Y. Overcoming the challenges of crossbar resistive memory architectures. In: Proceedings of the 21st IEEE International Symposium on High Performance Computer Architecture. 2015, 476−488
24 L, Song X, Qian H, Li Y Chen . PipeLayer: a pipelined ReRAM-based accelerator for deep learning. In: Proceedings of 2017 IEEE International Symposium on High Performance Computer Architecture. 2017, 541−552
25 H, Cai B, Liu J, Chen L, Naviner Y, Zhou Z, Wang J Yang . A survey of in-spin transfer torque MRAM computing. Science China Information Sciences, 2021, 64( 6): 160402
26 Y, Luo P, Wang X, Peng X, Sun S Yu . Benchmark of ferroelectric transistor-based hybrid precision synapse for neural network accelerator. IEEE Journal on Exploratory Solid-State Computational Devices and Circuits, 2019, 5( 2): 142–150
27 F, Xia D J, Jiang J, Xiong N H Sun . A survey of phase change memory systems. Journal of Computer Science and Technology, 2015, 30( 1): 121–144
28 N Gong . Multi level cell (MLC) in 3D crosspoint phase change memory array. Science China Information Sciences, 2021, 64( 6): 166401
29 K, Weinberger A, Dasgupta J, Langford A, Smola J Attenberg . Feature hashing for large scale multitask learning. In: Proceedings of the 26th Annual International Conference on Machine Learning. 2009, 1113−1120
30 H, Guan A, Malevich J, Yang J, Park H Yuen . Post-training 4-bit quantization on embedding tables. 2019, arXiv preprint arXiv: 1911.02079
31 I V Oseledets . Tensor-train decomposition. SIAM Journal on Scientific Computing, 2011, 33( 5): 2295–2317
32 T, Han P, Wang S, Niu C Li . Modality matches modality: pretraining modality-disentangled item representations for recommendation. In: Proceedings of the ACM Web Conference 2022. 2022, 2058−2066
33 Y, Long X, She S Mukhopadhyay . Design of reliable DNN accelerator with un-reliable ReRAM. In: Proceedings of 2019 Design, Automation & Test in Europe Conference & Exhibition. 2019, 1769−1774
34 X, Dong C, Xu Y, Xie N P Jouppi . NVSim: a circuit-level performance, energy, and area model for emerging nonvolatile memory. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2012, 31( 7): 994–1007
35 Y, Wang Z, Zhu F, Chen M, Ma G, Dai Y, Wang H, Li Y Chen . Rerec: in-ReRAM acceleration with access-aware mapping for personalized recommendation. In: Proceedings of 2021 IEEE/ACM International Conference on Computer Aided Design. 2021, 1−9
36 N, Muralimanohar R, Balasubramonian N Jouppi . Optimizing NUCA organizations and wiring alternatives for large caches with CACTI 6.0. In: Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture. 2007, 3−14
37 N, Jiang D U, Becker G, Michelogiannakis J, Balfour B, Towles D E, Shaw J, Kim W J Dally . A detailed and flexible cycle-accurate network-on-chip simulator. In: Proceedings of 2013 IEEE International Symposium on Performance Analysis of Systems and Software. 2013, 86−96
38 Y, Huang L, Zheng P, Yao Q, Wang X, Liao H, Jin J Xue . Accelerating graph convolutional networks using crossbar-based processing-in-memory architectures. In: Proceedings of 2022 IEEE International Symposium on High-Performance Computer Architecture. 2022, 1029−1042
39 Qu Y, Cai H, Ren K, Zhang W, Yu Y, Wen Y, Wang J. Product-based neural networks for user response prediction. In: Proceedings of the 16th IEEE International Conference on Data Mining. 2016, 1149−1154
40 Y, Qu B, Fang W, Zhang R, Tang M, Niu H, Guo Y, Yu X He . Product-based neural networks for user response prediction over multi-field categorical data. ACM Transactions on Information Systems, 2019, 37( 1): 5
41 H, Ko S, Lee Y, Park A Choi . A survey of recommendation systems: recommendation models, techniques, and application fields. Electronics, 2022, 11( 1): 141
42 D, Chen H, Jin L, Zheng Y, Huang P, Yao C, Gui Q, Wang H, Liu H, He X, Liao R Zheng . A general offloading approach for near-dram processing-in-memory architectures. In: Proceedings of 2022 IEEE International Parallel and Distributed Processing Symposium. 2022, 246−257
43 D, Chen H, He H, Jin L, Zheng Y, Huang X, Shen X Liao . MetaNMP: leveraging Cartesian-like product to accelerate HGNNs with near-memory processing. In: Proceedings of the 50th Annual International Symposium on Computer Architecture. 2023, 56
44 Y, Kwon Y, Lee M Rhu . Tensor casting: co-designing algorithm-architecture for personalized recommendation training. In: Proceedings of 2021 IEEE International Symposium on High-Performance Computer Architecture. 2021, 235−248
45 M, Wilkening U, Gupta S, Hsia C, Trippel C J, Wu D, Brooks G Y Wei . RecSSD: near data processing for solid state drive based recommendation inference. In: Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems. 2021, 717−729
[1] FCS-23397-OF-XS_suppl_1 Download
[1] Huize LI, Hai JIN, Long ZHENG, Yu HUANG, Xiaofei LIAO. ReCSA: a dedicated sort accelerator using ReRAM-based content addressable memory[J]. Front. Comput. Sci., 2023, 17(2): 172103-.
[2] Ming HE, Hao GUO, Guangyi LV, Le WU, Yong GE, Enhong CHEN, Haiping MA. Leveraging proficiency and preference for online Karaoke recommendation[J]. Front. Comput. Sci., 2020, 14(2): 273-290.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed