Please wait a minute...
Frontiers of Computer Science

ISSN 2095-2228

ISSN 2095-2236(Online)

CN 10-1014/TP

Postal Subscription Code 80-970

2018 Impact Factor: 1.129

Front. Comput. Sci.    2023, Vol. 17 Issue (2) : 172103    https://doi.org/10.1007/s11704-022-1322-3
RESEARCH ARTICLE
ReCSA: a dedicated sort accelerator using ReRAM-based content addressable memory
Huize LI, Hai JIN, Long ZHENG(), Yu HUANG, Xiaofei LIAO
National Engineering Research Center for Big Data Technology and System, Services Computing Technology and System Lab, Clusters and Grid Computing Lab, School of Computer Science and Technology, Huazhong University of Science and Technology, Wuhan 430074, China
 Download: PDF(7128 KB)   HTML
 Export: BibTeX | EndNote | Reference Manager | ProCite | RefWorks
Abstract

With the increasing amount of data, there is an urgent need for efficient sorting algorithms to process large data sets. Hardware sorting algorithms have attracted much attention because they can take advantage of different hardware’s parallelism. But the traditional hardware sort accelerators suffer “memory wall” problems since their multiple rounds of data transmission between the memory and the processor. In this paper, we utilize the in-situ processing ability of the ReRAM crossbar to design a new ReCAM array that can process the matrix-vector multiplication operation and the vector-scalar comparison in the same array simultaneously. Using this designed ReCAM array, we present ReCSA, which is the first dedicated ReCAM-based sort accelerator. Besides hardware designs, we also develop algorithms to maximize memory utilization and minimize memory exchanges to improve sorting performance. The sorting algorithm in ReCSA can process various data types, such as integer, float, double, and strings.

We also present experiments to evaluate the performance and energy efficiency against the state-of-the-art sort accelerators. The experimental results show that ReCSA has 90.92×, 46.13×, 27.38×, 84.57×, and 3.36× speedups against CPU-, GPU-, FPGA-, NDP-, and PIM-based platforms when processing numeric data sets. ReCSA also has 24.82×, 32.94×, and 18.22× performance improvement when processing string data sets compared with CPU-, GPU-, and FPGA-based platforms.

Keywords ReCAM      parallel sorting      architecture design      processing-in-memory     
Corresponding Author(s): Long ZHENG   
Just Accepted Date: 15 December 2021   Issue Date: 02 August 2022
 Cite this article:   
Huize LI,Hai JIN,Long ZHENG, et al. ReCSA: a dedicated sort accelerator using ReRAM-based content addressable memory[J]. Front. Comput. Sci., 2023, 17(2): 172103.
 URL:  
https://academic.hep.com.cn/fcs/EN/10.1007/s11704-022-1322-3
https://academic.hep.com.cn/fcs/EN/Y2023/V17/I2/172103
Fig.1  (a) A 2×2 ReRAM crossbar array for the matrix-vector multiplication (MVM) operation, and (b) A 2T2R ReCAM array
Fig.2  Time breakdown of conventional sorting accelerators running five real-world data sets (CPU-M means CPU’s memory access time and CPU-E means CPU’s execution time)
Fig.3  (a) Database data mapping in ReSQM (rows stored tuples and columns stored attributes), (b) ReCSA’s mapping strategy for float keys, and (c) string mapping strategy in ReCSA
Fig.4  Designed 1D1R ReCAM array to perform the MVM and vector-scalar operations
Fig.5  (a) Overview of the ReCSA architecture, (b) details of the ReCAM-based SE, (c) detailed structure of the register, (d) the architecture of the ReCAM MAT, and (e) the ReCAM array architecture used in this work
Fig.6  Filtering and counting algorithm. The input of this algorithm is Rx.V_cand and the output of this algorithm is the information stored in Rx.b[32] to Rx.b[1] as shown in lines 16 to 19
Fig.7  The algorithm of finding the minimum and the candidate minimum of a vector. The input of this algorithm is all sub-registers in REGs, and the output of this algorithm is the triples stored in Rx.b[0] as line 17 shows
Fig.8  Voting algorithm. The input of this algorithm is all triples stored in Rx.b[0] and there is no output
Fig.9  Numeric sorting algorithm. The input of this algorithm is the data set stored in the CAM array and the output is the sorted keys stored in the RM
Letter Digital L D L D L D
A 00001 B 00010 C 00011 D 00100
E 00101 F 00110 G 00111 H 01000
I 01001 J 01010 K 01011 L 01100
M 01101 N 01110 O 01111 P 10000
Q 10001 R 10010 S 10011 T 10100
U 10101 V 10110 W 10111 X 11000
Y 11001 Z 11010 EOS 00000 ? ?
Tab.1  Digital presentation of each letter in the English alphabet
Byte-key/B Size-key/GB Size-pair/GB Range θ of zipf
<5 2 4 a?z 0.75/1
<10 4 8 a?z 0.75/1
<15 8 16 a?z 0.75/1
<20 16 32 a?z 0.75/1
<30 32 64 a?z 0.75/1
Tab.2  Configurations of our string data sets
Dataset Details Size/MB
Random 106 random strings of length ≈ 100 97
Words 107 words with no duplicates 118
Url 107 strings containing urls 305
Genome 30 million strings of a, t, g, c 302
Filelist 10 million strings containing filepaths 656
Tab.3  Details of the real-world data sets used in our experiment
Components Size Area/mm2 Power
Array 4×256×256 0.0124 4.63 mW
TAG 4×256 1.49e–3 71.2 μW
Driver 4×256 5.52e–4 16.8 μW
ADC 1 0.004 2.96 mW
Mask Register 4 9.17e–5 5.1 μW
Total Group 1 0.0187 7.69 mW
Groups 8000 151.42 62.08 W
CTRL 1 0.17 43.6 mW
sALU 1 0.543 163.5 mW
REGs 1 0.118 73.2 mW
Total SE 1 152.64 62.46 W
Tab.4  Configurations of the LRLC ReCAM array and LRLC SE
Name Capacity/Gbits SE Area/mm2 Power/W Frequency/MHz
SRSC 0.5 107.28 57.84 748.6
SRLC 1 121.09 59.27 737.1
LRSC 1 124.37 59.92 734.9
LRLC 2 152.64 62.46 718.8
Tab.5  Characters of the four configurations of one SE
Fig.10  Four configurations of ReCSA running 16GB numeric key-only workloads in various θ
Fig.11  Energy consumption when perform numeric key-only workloads at Zipf θ is 0.75
Fig.12  Typical results of all platforms running various numeric and string workloads, and some results of ReSQM and NDP based platforms are missing since they can not support string data set. (a) Numeric key-only workloads when Zipf θ is 1; (b) numeric key-only workloads when Zipf θ is 0.75; (c) string key-only workloads when Zipf θ is 1; (d) string key-only workloads when Zipf θ is 0.75; (e) numeric key-value pair workloads when Zipf θ is 1; (f) string key-value pair workloads when Zipf θ is 1
Fig.13  Sorting rate of ReCSA and other platforms running five real-world data sets
Name Number of SEs Total area/mm2 Total power/kW Energy/J
SRSC 256 2.87e4 14.93 182.19
SRLC 128 1.94e4 7.78 185.43
LRSC 128 1.86e4 7.92 185.71
LRLC 64 9.77e3 4.45 187.38
Tab.6  The energy and area cost of four ReCSA configurations running 16GB 32-bit key-only input when Zipf θ = 0.75
  
  
  
  
  
1 D, Carlson L Carin . Continuing progress of spike sorting in the era of big data. Current Opinion in Neurobiology, 2019, 55: 90– 96
2 A, Kuritzin T, Kischka J, Schmitz G Churakov . Incomplete lineage sorting and hybridization statistics for large-scale retroposon insertion data. PLoS Computational Biology, 2016, 12( 3): e1004812
3 L S, Heath J P C Vergara . Sorting by short swaps. Current Opinion in Neurobiology, 2003, 10( 5): 775– 789
4 N, Tsuda T, Satoh T Kawada. A piepline sorting chip. In: Proceedings of IEEE International Solid-State Circuits Conference. 1987, 270– 271
5 J, Casper K Olukotun. Hardware acceleration of database operations. In: Proceedings of 2014 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. 2014, 151– 160
6 R Cole . Parallel merge sort. SIAM Journal on Computing, 1988, 17( 4): 770– 785
7 R, Mueller J, Teubner G Alonso . Sorting networks on FPGAs. The VLDB Journal, 2012, 21( 1): 1– 23
8 J L, Bentley R Sedgewick. Fast algorithms for sorting and searching strings. In: Proceedings of the 8th Annual ACM-SIAM Symposium on Discrete Algorithms. 1997, 360– 369
9 K Ø, Arisland A C, Aasbø A Nundal . VLSI parallel shift sort algorithm and design. Integration, 1984, 2( 4): 331– 347
10 A, Farmahini-Farahani III H J, Duwe M J, Schulte K Compton . Modular design of high-throughput, low-latency sorting units. IEEE Transactions on Computers, 2013, 62( 7): 1389– 1402
11 N, Govindaraju J, Gray R, Kumar D Manocha. GPUTeraSort: high performance graphics co-processor sorting for large database management. In: Proceedings of the 2006 ACM SIGMOD International Conference on Management of Data. 2006, 325– 336
12 D P, Singh I, Joshi J Choudhary . Survey of GPU based sorting algorithms. International Journal of Parallel Programming, 2018, 46( 6): 1017– 1034
13 N, Satish M, Harris M Garland. Designing efficient sorting algorithms for manycore GPUs. In: Proceedings of IEEE International Symposium on Parallel & Distributed Processing. 2009, 1– 10
14 N, Satish C, Kim J, Chhugani A D, Nguyen V W, Lee D, Kim P Dubey. Fast sort on CPUs and GPUs: a case for bandwidth oblivious SIMD sort. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data. 2010, 351– 362
15 D, Koch J Torresen. FPGASort: a high performance sorting architecture exploiting run-time reconfiguration on FPGAs for large problem sorting. In: Proceedings of the 19th ACM/SIGDA International Symposium on Field Programmable Gate Arrays. 2011, 45– 54
16 M, Saitoh E A, Elsayed Chu T, Van S, Mashimo K Kise. A high-performance and cost-effective hardware merge sorter without feedback datapath. In: Proceedings of the 26th IEEE Annual International Symposium on Field-Programmable Custom Computing Machines. 2018, 197– 204
17 M, Cho D, Brand R, Bordawekar U, Finkler V, Kulandaisamy R Puri . PARADIS: an efficient parallel algorithm for in-place radix sort. Proceedings of the VLDB Endowment, 2015, 8( 2): 1518– 1529
18 M, Minutoli S K, Kuntz A, Tumeo P Kogge. Implementing radix sort on emu 1. In: Proceedings of the 3rd Workshop NearData Process. 2015
19 R, Balasubramonian J, Chang T, Manning J H, Moreno R, Murphy R, Nair S Swanson . Near-data processing: insights from a MICRO-46 workshop. IEEE Micro, 2014, 34( 4): 36– 42
20 Q, Zhu B, Akin H E, Sumbul F, Sadi J C, Hoe L, Pileggi F Franchetti. A 3D-stacked logic-in-memory accelerator for application-specific data intensive computing. In: Proceedings of IEEE International 3D Systems Integration Conference. 2013, 1– 7
21 H, Akinaga H Shima . Resistive random access memory (ReRAM) based on metal oxides. Proceedings of the IEEE, 2010, 98( 12): 2237– 2251
22 L, Yavits S, Kvatinsky A, Morad R Ginosar . Resistive associative processor. IEEE Computer Architecture Letters, 2015, 14( 2): 148– 151
23 M, Imani S, Gupta S, Sharma T S Rosing . NVQuery: efficient query processing in nonvolatile memory. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2019, 38( 4): 628– 639
24 H, Li H, Jin L, Zheng X Liao . ReSQM: accelerating database operations using ReRAM-based content addressable memory. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2020, 39( 11): 4030– 4041
25 N, Leischner V, Osipov P Sanders. GPU sample sort. In: Proceedings of the IEEE International Symposium on Parallel & Distributed Processing. 2010, 1– 10
26 E, Sintorn U Assarsson . Fast parallel GPU-sorting using a hybrid algorithm. Journal of Parallel and Distributed Computing, 2008, 68( 10): 1381– 1388
27 D, Cederman P Tsigas . GPU-Quicksort: a practical quicksort algorithm for graphics processors. ACM Journal of Experimental Algorithmics, 2009, 14: 4
28 W, Song D, Koch M, Luján J Garside. Parallel hardware merge sorter. In: Proceedings of Annual International Symposium on Field-Programmable Custom Computing Machines. 2016, 95– 102
29 N, Samardzic W, Qiao V, Aggarwal M C F, Chang J Cong. Bonsai: high-performance adaptive merge tree sorting. In: Proceedings of the 47th ACM/IEEE Annual International Symposium on Computer Architecture. 2020, 282– 294
30 P, Siegl R, Buchty M Berekovic. Data-centric computing frontiers: a survey on processing-in-memory. In: Proceedings of the 2nd International Symposium on Memory Systems. 2016, 295– 308
31 Z, Li N, Challapalle A K, Ramanathan V Narayanan. IMC-Sort: in-memory parallel sorting architecture using hybrid memory cube. In: Proceedings of the 30th Great Lakes Symposium on VLSI. 2020, 45– 50
32 A K, Prasad M, Rezaalipour M, Dehyadegari M N Bojnordi. Memristive data ranking. In: Proceedings of International Symposium on High-Performance Computer Architecture. 2021, 440– 452
33 H S P, Wong S, Raoux S, Kim J, Liang J P, Reifenberg B, Rajendran M, Asheghi K E Goodson . Phase change memory. Proceedings of the IEEE, 2010, 98( 12): 2201– 2227
34 K L, Wang J G, Alzate P K Amiri . Low-power non-volatile spintronic memory: STT-RAM and beyond. Journal of Physics D: Applied Physics, 2013, 46( 7): 074003
35 J, Li R K, Montoye M, Ishii L Chang . 1 Mb 0. 41 µm2 2T-2R cell nonvolatile TCAM with two-bit encoding and clocked self-referenced sensing. IEEE Journal of Solid-State Circuits , 2014, 49( 4): 896– 907
36 M F, Chang L Y, Huang W Z, Lin Y N, Chiang C C, Kuo C H, Chuang K H, Yang H J, Tsai T F, Chen S S Sheu . A ReRAM-based 4T2R nonvolatile TCAM using RC-filtered stress-decoupled scheme for frequent-OFF instant-ON search engines used in IoT and big-data processing. IEEE Journal of Solid-State Circuits, 2016, 51( 11): 2786– 2798
37 S, Matsunaga A, Katsumata M, Natsui S, Fukami T, Endoh H, Ohno T Hanyu. Fully parallel 6T-2MTJ nonvolatile TCAM with single-transistor-based self match-line discharge control. In: Proceedings of Symposium on VLSI Circuits - Digest of Technical Papers. 2011, 298– 299
38 L, Zhao Q, Deng Y, Zhang J Yang. RFAcc: a 3D ReRAM Associative array based random forest accelerator. In: Proceedings of the ACM International Conference on Supercomputing. 2019, 473– 483
39 W, Huangfu S, Li X, Hu Y Xie. RADAR: a 3D-ReRAM based DNA alignment accelerator architecture. In: Proceedings of the 55th ACM/ESDA/IEEE Design Automation Conference (DAC). 2018, 1– 6
40 B, Neelima B, Shamsundar A, Narayan R, Prabhu C Gomes . Kepler GPU accelerated recursive sorting using dynamic parallelism. Concurrency and Computation: Practice and Experience, 2017, 29( 4): e3865
41 M, Asiatici D, Maiorano P Ienne. FPGAs in the datacenters: the case of parallel hybrid super scalar string sample sort. In: Proceedings of the 31st IEEE International Conference on Application-specific Systems, Architectures and Processors. 2020, 133– 140
42 J T Pawlowski. Hybrid memory cube (HMC). In: Proceedings of IEEE Hot Chips 23 Symposium. 2011, 1– 24
43 J, Kim Y Kim. HBM: Memory solution for bandwidth-hungry processors. In: Proceedings of IEEE Hot Chips 26 Symposium. 2014, 1– 24
44 R, Sinha J, Zobel D Ring. Cache-efficient string sorting using copying. ACM Journal of Experimental Algorithmics, 2006, 11( 1.2): 1– 32
45 L, Yavits A, Morad R Ginosar . Computer architecture with associative processor replacing last-level cache and SIMD accelerator. IEEE Transactions on Computers, 2015, 64( 2): 368– 381
46 Y C Lien. A 4.5-mW 8-b 750-MS/s 2-b/step asynchronous subranged SAR ADC in 28-nm CMOS technology. In: Proceedings of Symposium on VLSI Circuits. 2012, 88– 89
47 D, Niu C, Xu N, Muralimanohar N P, Jouppi Y Xie. Design of cross-point metal-oxide ReRAM emphasizing reliability and cost. In: Proceedings of IEEE/ACM International Conference on Computer-Aided Design. 2013, 17– 23
48 E, Stehle H A Jacobsen. A memory bandwidth-efficient hybrid radix sort on GPUs. In: Proceedings of the 2017 ACM International Conference on Management of Data. 2017, 417– 432
49 H, David E, Gorbatov U R, Hanebutte R, Khanna C Le. RAPL: memory power estimation and capping. In: Proceedings of ACM/IEEE International Symposium on Low-Power Electronics and Design. 2010, 189– 194
50 A, Deshpande P J Narayanan. Can GPUs sort strings efficiently? In: Proceedings of the 20th Annual International Conference on High Performance Computing. 2013, 305– 313
[1] FCS-21322-OF-HL_suppl_1 Download
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed