ARCHER: a ReRAM-based accelerator for compressed recommendation systems

doi:10.1007/s11704-023-3397-x

Front. Comput. Sci.

2024, Vol. 18

Issue (5) : 185607 https://doi.org/10.1007/s11704-023-3397-x

Information Systems

ARCHER: a ReRAM-based accelerator for compressed recommendation systems

Xinyang SHEN, Xiaofei LIAO, Long ZHENG(

), Yu HUANG, Dan CHEN, Hai JIN

National Engineering Research Center for Big Data Technology and System, Services Computing Technology and System Lab, Clusters and Grid Computing Lab, School of Computer Science and Technology, Huazhong University of Science and Technology, Wuhan 430074, China

Download: PDF(7144 KB) HTML
Export: BibTeX | EndNote | Reference Manager | ProCite | RefWorks

Abstract

Modern recommendation systems are widely used in modern data centers. The random and sparse embedding lookup operations are the main performance bottleneck for processing recommendation systems on traditional platforms as they induce abundant data movements between computing units and memory. ReRAM-based processing-in-memory (PIM) can resolve this problem by processing embedding vectors where they are stored. However, the embedding table can easily exceed the capacity limit of a monolithic ReRAM-based PIM chip, which induces off-chip accesses that may offset the PIM profits. Therefore, we deploy the decomposed model on-chip and leverage the high computing efficiency of ReRAM to compensate for the decompression performance loss. In this paper, we propose ARCHER, a ReRAM-based PIM architecture that implements fully on-chip recommendations under resource constraints. First, we make a full analysis of the computation pattern and access pattern on the decomposed table. Based on the computation pattern, we unify the operations of each layer of the decomposed model in multiply-and-accumulate operations. Based on the access observation, we propose a hierarchical mapping schema and a specialized hardware design to maximize resource utilization. Under the unified computation and mapping strategy, we can coordinate the inter-processing elements pipeline. The evaluation shows that ARCHER outperforms the state-of-the-art GPU-based DLRM system, the state-of-the-art near-memory processing recommendation system RecNMP, and the ReRAM-based recommendation accelerator REREC by $15.79 ×$ , $2.21 ×$ , and $1.21 ×$ in terms of performance and $56.06 ×$ , $6.45 ×$ , and $1.71 ×$ in terms of energy savings, respectively.

Keywords recommendation system ReRAM processing-in-memory embedding layer

Corresponding Author(s): Long ZHENG

Just Accepted Date: 09 October 2023 Issue Date: 15 December 2023

Cite this article:

Xinyang SHEN,Xiaofei LIAO,Long ZHENG, et al. ARCHER: a ReRAM-based accelerator for compressed recommendation systems[J]. Front. Comput. Sci., 2024, 18(5): 185607.

URL:

https://academic.hep.com.cn/fcs/EN/10.1007/s11704-023-3397-x
https://academic.hep.com.cn/fcs/EN/Y2024/V18/I5/185607

Fig.1 Chip scale review of representative ReRAM accelerators. The blue triangle and the green lozenge mark the scale of the analog and digital parts of ReRAM device in the accelerator, respectively

Fig.2 Overview of (a) DLRM model. (b) Example of embedding tale

T 0

compressed (decomposed) into 3 TT cores. (c) An example of a decompression (composition) process where a

4 × 32

core row

r 1

in TT core 1 multiplies a

32 × 4 × 32

core row

r 2

in TT core 2. The intermediate result

I R

is a

32 × 4 × 4

tensor, which multiplies a

4 × 32

core row in TT core 3 to get a resulting

4 × 4 × 4

tensor

R

. Eventually,

R

is reshaped into a

1 × 64

vector

O R

, the original row in the embedding table

Fig.3 Access patterns of original tables and TT cores. (a) Original table access frequency. The Original row represents the row index in the original table. The Frequency represents the access frequency of a range of rows. (b) TT core access frequency. The TT core represents the core index

d

. The Core row represents the core row index

Fig.4 The overview of ARCHER: chip, tile, and ARU

Fig.5 The workflow of ARCHER

Fig.6 Model mapping of ARCHER. (a) Three original tables T1,T2 and T3: A sample query on the target row on T1 and the corresponding process of core row

c r 1

c r 2

and

c r 3

on the compressed T1 (C-T1) composed of TT core 1, TT core 2 and TT core 3. (b) An example of layer-wise model mapping and index-based core mapping for C-T1,C-T2, and C-T3. (c) Mapping process of

c r 1

c r 2

and

c r 3

on the ReRAM hardware

Fig.7 An example of pipeline execution of ARCHER for 3 batches. Five main pipeline operations are depicted: first-stage composition (FSC), bottom MLP (B-MLP), second-stage composition (SSC), interaction, and top-MLP (T-MLP)

CPU	Intel Xeon CPU E5-2680 v4, 28 cores, 2.4 GHz
Cache	L1 64 KB, L2 256 KB, L3 35 MB
Main memory	256 GB DDR4

GPU	Tesla P100, 56 SMs $×$ 64 cores, 1.33 GHz
Cache	L1 64 KB/SM, L2 4 MB
GPU main memory	16 GB HBM2

RecNMP	4 channel $×$ 1 DIMM-NMP $×$ 2 Rank-NMPs
Memory	64 GB DRAM

REREC	419 MB MLP/Memory arrays, 1048 KB Inner-product arrays

Tab.1 Hardware configurations of CPU, GPU, RecNMP and REREC

Tab.2 Models properties

Model	RMC0	RMC1	RMC2	RMC3	RMC4	RMC5	RMC6
Emb.	$1 ×$	$0.5 ×$	$2 ×$	$4 ×$	$1 ×$	$1 ×$	$1 ×$
MLP	$1 ×$	$1 ×$	$1 ×$	$1 ×$	$0.5 ×$	$2 ×$	$4 ×$
Size (GBs)	$2.1$	1.1	4.1	8.1	2.1	2.1	2.1

Tab.3 Configurable model parameters

Component	Param.	Spec.	Pow. (mW)	Area (mm²)
ARU properties (8 ARUs per tile)
ADC	Number	32	64	0.00384
ADC	Resolution	8 b	64	0.00384
DAC	Number	32 $×$ 64	8	0.00034
DAC	Resolution	2 b	8	0.00034
S&H	Number	32 $×$ 64	0.020	0.000080
Crossbar	Number	32	6.2	0.0005
	Size	64 $×$ 64
	bits/cell	2
S&A	Number	16	0.80	0.00096
IR	Size	4 KB	2.32	0.0038
OR	Size	512 B	0.42	0.0014
Tile properties (64 PEs per chip)
ARU Total	Number	8	653.92	0.36384
Memory ReRAM	Size	2 MB	35.078	0.3440
EDRAM	Size	64 KB	20.7	0.083
Sigmoid	Number	2	0.52	0.0006
Chip properties
Tile Total	−	−	45.454K	50.652
NoC	Flit_size	128 b	75	0.58
Chip Total	−	−	45.529K	51.232

Tab.4 Hardware configurations of ARCHER

Fig.8 Performance of ARCHER with different embedding layers. ‘OoM’ indicates that the model request space is out of the GPU memory

Fig.9 Performance of ARCHER with different MLP layers

Fig.10 Energy saving of ARCHER with different embedding layers

Fig.11 Energy of ARCHER with different MLP layers

Fig.12 Performance of ARCHER against REREC

Fig.13 Energy of ARCHER against REREC

Fig.14 Performance of ARCHER with different mapping schemas

Fig.15 Performance of ARCHER with different ranks

Fig.16 Chip area breakdown of ARCHER

1	Ke L, Gupta U, Cho B Y, Brooks D, Chandra V, Diril U, Firoozshahian A, Hazelwood K, Jia B, Lee H H S, Li M, Maher B, Mudigere D, Naumov M, Schatz M, Smelyanskiy M, Wang X, Reagen B, Wu C J, Hempstead M, Zhang X. RecNMP: Accelerating personalized recommendation with near-memory processing. In: Proceedings of the 47th ACM/IEEE Annual International Symposium on Computer Architecture. 2020, 790−803
2	M, Naumov D, Mudigere H J M, Shi J, Huang N, Sundaraman J, Park X, Wang U, Gupta C J, Wu A G, Azzolini D, Dzhulgakov A, Mallevich I, Cherniavskii Y, Lu R, Krishnamoorthi A, Yu V, Kondratenko S, Pereira X, Chen W, Chen V, Rao B, Jia L, Xiong M Smelyanskiy . Deep learning recommendation model for personalization and recommendation systems. 2019, arXiv preprint arXiv: 1906.00091
3	U, Gupta C J, Wu X, Wang M, Naumov B, Reagen D, Brooks B, Cottel K, Hazelwood M, Hempstead B, Jia H H S, Lee A, Malevich D, Mudigere M, Smelyanskiy L, Xiong X Zhang . The architectural implications of Facebook’s DNN-based personalized recommendation. In: Proceedings of 2020 IEEE International Symposium on High Performance Computer Architecture. 2020, 488−501
4	J, Wu X, He X, Wang Q, Wang W, Chen J, Lian X Xie . Graph convolution machine for context-aware recommender system. Frontiers of Computer Science, 2022, 16( 6): 166614
5	H, Guo R, Tang Y, Ye Z, Li X, He Z Dong . DeepFM: an end-to-end wide & deep learning framework for CTR prediction. 2018, arXiv preprint arXiv: 1804.04950
6	G, Zhou N, Mou Y, Fan Q, Pi W, Bian C, Zhou X, Zhu K Gai . Deep interest evolution network for click-through rate prediction. In: Proceedings of the 33rd AAAI Conference on Artificial Intelligence. 2019, 5941−5948
7	Hwang R, Kim T, Kwon Y, Rhu M. Centaur: a chiplet-based, hybrid sparse-dense accelerator for personalized recommendations. In: Proceedings of the 47th ACM/IEEE Annual International Symposium on Computer Architecture. 2020, 968−981
8	Kal H, Lee S, Ko G, Ro W W. SPACE: locality-aware processing in heterogeneous memory for personalized recommendations. In: Proceedings of the 48th ACM/IEEE Annual International Symposium on Computer Architecture. 2021, 679−691
9	A, Shafiee A, Nag N, Muralimanohar R, Balasubramonian J P, Strachan M, Hu R S, Williams V Srikumar . ISAAC: a convolutional neural network accelerator with in-situ analog arithmetic in crossbars. ACM SIGARCH Computer Architecture News, 2016, 44( 3): 14–26
10	P, Chi S, Li C, Xu T, Zhang J, Zhao Y, Liu Y, Wang Y Xie . PRIME: a novel processing-in-memory architecture for neural network computation in ReRAM-based main memory. ACM SIGARCH Computer Architecture News, 2016, 44( 3): 27–39
11	Imani M, Gupta S, Kim Y, Rosing T. FloatPIM: in-memory acceleration of deep neural network training with high precision. In: Proceedings of the 46th ACM/IEEE Annual International Symposium on Computer Architecture. 2019, 802−815
12	L, Song Y, Zhuo X, Qian H, Li Y Chen . GraphR: accelerating graph processing using ReRAM. In: Proceedings of 2018 IEEE International Symposium on High Performance Computer Architecture. 2018, 531−543
13	Y, Huang L, Zheng P, Yao J, Zhao X, Liao H, Jin J Xue . A heterogeneous PIM hardware-software co-design for energy-efficient graph processing. In: Proceedings of 2020 IEEE International Parallel and Distributed Processing Symposium. 2020, 684−695
14	L, Zheng J, Zhao Y, Huang Q, Wang Z, Zeng J, Xue X, Liao H Jin . Spara: an energy-efficient ReRAM-based accelerator for sparse graph analytics applications. In: Proceedings of 2020 IEEE International Parallel and Distributed Processing Symposium. 2020, 696−707
15	A I, Arka J R, Doppa P P, Pande B K, Joardar K Chakrabarty . ReGraphX: NoC-enabled 3D heterogeneous ReRAM architecture for training graph neural networks. In: Proceedings of 2021 Design, Automation & Test in Europe Conference & Exhibition. 2021, 1667−1672
16	Zha Y, Li J. Hyper-AP: enhancing associative processing through a full-stack optimization. In: Proceedings of the 47th ACM/IEEE Annual International Symposium on Computer Architecture. 2020, 846−859
17	M, Imani S, Pampana S, Gupta M, Zhou Y, Kim T Rosing . DUAL: acceleration of clustering algorithms using digital-based processing in-memory. In: Proceedings of the 53rd Annual IEEE/ACM International Symposium on Microarchitecture. 2020, 356−371
18	D, Niu C, Xu N, Muralimanohar N P, Jouppi Y Xie . Design of cross-point metal-oxide ReRAM emphasizing reliability and cost. In: Proceedings of 2013 IEEE/ACM International Conference on Computer-Aided Design. 2013, 17−23
19	H S P, Wong H Y, Lee S, Yu Y S, Chen Y, Wu P S, Chen B, Lee F T, Chen M J Tsai . Metal−oxide RRAM. Proceedings of the IEEE, 2012, 100( 6): 1951–1970
20	H, Li H, Jin L, Zheng Y, Huang X Liao . ReCSA: a dedicated sort accelerator using ReRAM-based content addressable memory. Frontiers of Computer Science, 2023, 17( 2): 172103
21	Yin C, Acun B, Wu C J, Liu X. TT-Rec: Tensor train compression for deep learning recommendation models. 2021, arXiv preprint arXiv: 2101.11714
22	M, Hu J P, Strachan Z, Li E M, Grafals N, Davila C, Graves S, Lam N, Ge J J, Yang R S Williams . Dot-product engine for neuromorphic computing: programming 1T1M crossbar to accelerate matrix-vector multiplication. In: Proceedings of the 53rd ACM/EDAC/IEEE Design Automation Conference. 2016, 1−6
23	Xu C, Niu D, Muralimanohar N, Balasubramonian R, Zhang T, Yu S, Xie Y. Overcoming the challenges of crossbar resistive memory architectures. In: Proceedings of the 21st IEEE International Symposium on High Performance Computer Architecture. 2015, 476−488
24	L, Song X, Qian H, Li Y Chen . PipeLayer: a pipelined ReRAM-based accelerator for deep learning. In: Proceedings of 2017 IEEE International Symposium on High Performance Computer Architecture. 2017, 541−552
25	H, Cai B, Liu J, Chen L, Naviner Y, Zhou Z, Wang J Yang . A survey of in-spin transfer torque MRAM computing. Science China Information Sciences, 2021, 64( 6): 160402
26	Y, Luo P, Wang X, Peng X, Sun S Yu . Benchmark of ferroelectric transistor-based hybrid precision synapse for neural network accelerator. IEEE Journal on Exploratory Solid-State Computational Devices and Circuits, 2019, 5( 2): 142–150
27	F, Xia D J, Jiang J, Xiong N H Sun . A survey of phase change memory systems. Journal of Computer Science and Technology, 2015, 30( 1): 121–144
28	N Gong . Multi level cell (MLC) in 3D crosspoint phase change memory array. Science China Information Sciences, 2021, 64( 6): 166401
29	K, Weinberger A, Dasgupta J, Langford A, Smola J Attenberg . Feature hashing for large scale multitask learning. In: Proceedings of the 26th Annual International Conference on Machine Learning. 2009, 1113−1120
30	H, Guan A, Malevich J, Yang J, Park H Yuen . Post-training 4-bit quantization on embedding tables. 2019, arXiv preprint arXiv: 1911.02079
31	I V Oseledets . Tensor-train decomposition. SIAM Journal on Scientific Computing, 2011, 33( 5): 2295–2317
32	T, Han P, Wang S, Niu C Li . Modality matches modality: pretraining modality-disentangled item representations for recommendation. In: Proceedings of the ACM Web Conference 2022. 2022, 2058−2066
33	Y, Long X, She S Mukhopadhyay . Design of reliable DNN accelerator with un-reliable ReRAM. In: Proceedings of 2019 Design, Automation & Test in Europe Conference & Exhibition. 2019, 1769−1774
34	X, Dong C, Xu Y, Xie N P Jouppi . NVSim: a circuit-level performance, energy, and area model for emerging nonvolatile memory. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2012, 31( 7): 994–1007
35	Y, Wang Z, Zhu F, Chen M, Ma G, Dai Y, Wang H, Li Y Chen . Rerec: in-ReRAM acceleration with access-aware mapping for personalized recommendation. In: Proceedings of 2021 IEEE/ACM International Conference on Computer Aided Design. 2021, 1−9
36	N, Muralimanohar R, Balasubramonian N Jouppi . Optimizing NUCA organizations and wiring alternatives for large caches with CACTI 6.0. In: Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture. 2007, 3−14
37	N, Jiang D U, Becker G, Michelogiannakis J, Balfour B, Towles D E, Shaw J, Kim W J Dally . A detailed and flexible cycle-accurate network-on-chip simulator. In: Proceedings of 2013 IEEE International Symposium on Performance Analysis of Systems and Software. 2013, 86−96
38	Y, Huang L, Zheng P, Yao Q, Wang X, Liao H, Jin J Xue . Accelerating graph convolutional networks using crossbar-based processing-in-memory architectures. In: Proceedings of 2022 IEEE International Symposium on High-Performance Computer Architecture. 2022, 1029−1042
39	Qu Y, Cai H, Ren K, Zhang W, Yu Y, Wen Y, Wang J. Product-based neural networks for user response prediction. In: Proceedings of the 16th IEEE International Conference on Data Mining. 2016, 1149−1154
40	Y, Qu B, Fang W, Zhang R, Tang M, Niu H, Guo Y, Yu X He . Product-based neural networks for user response prediction over multi-field categorical data. ACM Transactions on Information Systems, 2019, 37( 1): 5
41	H, Ko S, Lee Y, Park A Choi . A survey of recommendation systems: recommendation models, techniques, and application fields. Electronics, 2022, 11( 1): 141
42	D, Chen H, Jin L, Zheng Y, Huang P, Yao C, Gui Q, Wang H, Liu H, He X, Liao R Zheng . A general offloading approach for near-dram processing-in-memory architectures. In: Proceedings of 2022 IEEE International Parallel and Distributed Processing Symposium. 2022, 246−257
43	D, Chen H, He H, Jin L, Zheng Y, Huang X, Shen X Liao . MetaNMP: leveraging Cartesian-like product to accelerate HGNNs with near-memory processing. In: Proceedings of the 50th Annual International Symposium on Computer Architecture. 2023, 56
44	Y, Kwon Y, Lee M Rhu . Tensor casting: co-designing algorithm-architecture for personalized recommendation training. In: Proceedings of 2021 IEEE International Symposium on High-Performance Computer Architecture. 2021, 235−248
45	M, Wilkening U, Gupta S, Hsia C, Trippel C J, Wu D, Brooks G Y Wei . RecSSD: near data processing for solid state drive based recommendation inference. In: Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems. 2021, 717−729

[1]

FCS-23397-OF-XS_suppl_1

Download

[1]	Huize LI, Hai JIN, Long ZHENG, Yu HUANG, Xiaofei LIAO. ReCSA: a dedicated sort accelerator using ReRAM-based content addressable memory[J]. Front. Comput. Sci., 2023, 17(2): 172103-.
[2]	Ming HE, Hao GUO, Guangyi LV, Le WU, Yong GE, Enhong CHEN, Haiping MA. Leveraging proficiency and preference for online Karaoke recommendation[J]. Front. Comput. Sci., 2020, 14(2): 273-290.

Viewed

Full text

Abstract

Cited

Shared

Discussed