D-Cubicle: boosting data transfer dynamically for large-scale analytical queries in single-GPU systems

doi:10.1007/s11704-022-2160-z

Front. Comput. Sci.

2023, Vol. 17

Issue (4) : 174610 https://doi.org/10.1007/s11704-022-2160-z

RESEARCH ARTICLE

D-Cubicle: boosting data transfer dynamically for large-scale analytical queries in single-GPU systems

Jialun WANG, Wenhao PANG, Chuliang WENG(

), Aoying ZHOU

School of Data Science and Engineering, East China Normal University, Shanghai 200062, China

Download: PDF(8186 KB) HTML
Export: BibTeX | EndNote | Reference Manager | ProCite | RefWorks

Abstract

In analytical queries, a number of important operators like JOIN and GROUP BY are suitable for parallelization, and GPU is an ideal accelerator considering its power of parallel computing. However, when data size increases to hundreds of gigabytes, one GPU card becomes insufficient due to the small capacity of global memory and the slow data transfer between host and device. A straightforward solution is to equip more GPUs linked with high-bandwidth connectors, but the cost will be highly increased. We utilize unified memory (UM) produced by NVIDIA CUDA (Compute Unified Device Architecture) to make it possible to accelerate large-scale queries on just one GPU, but we notice that the transfer performance between host and UM, which happens before kernel execution, is often significantly slower than the theoretical bandwidth. An important reason is that, in single-GPU environment, data processing systems usually invoke only one or a static number of threads for data copy, leading to an inefficient transfer which slows down the overall performance heavily. In this paper, we present D-Cubicle, a runtime module to accelerate data transfer between host-managed memory and unified memory. D-Cubicle boosts the actual transfer speed dynamically through a self-adaptive approach. In our experiments, taking data transfer into account, D-Cubicle processes 200 GB of data on a single GPU with 32 GB of global memory, achieving 1.43x averagely and 2.09x maximally the performance of the baseline system.

Keywords data analytics GPU unified memory

Corresponding Author(s): Chuliang WENG

Just Accepted Date: 30 June 2022 Issue Date: 09 December 2022

Cite this article:

Jialun WANG,Wenhao PANG,Chuliang WENG, et al. D-Cubicle: boosting data transfer dynamically for large-scale analytical queries in single-GPU systems[J]. Front. Comput. Sci., 2023, 17(4): 174610.

URL:

https://academic.hep.com.cn/fcs/EN/10.1007/s11704-022-2160-z
https://academic.hep.com.cn/fcs/EN/Y2023/V17/I4/174610

Fig.1 An overview of the discrete CPU-GPU architecture

Fig.2 Dataflow in a GPU-accelerated analytical system

Fig.3 Unified memory

Fig.4 D-Cubicle overview

Fig.5 Data transfer speeds of different sizes

Fig.6 Example of self-adaptive memory copy

Fig.7 Comparison of transfer times between CUDA and D-Cubicle. (a) Memory copy based on CUDA; (b) memory copy based on D-Cubicle

Fig.8 Performance evaluation of original OmniSciDB with SF20

Fig.9 Time ratio of OmniSciDB dealing with TPC-H queries (SF20)

Fig.10 Performance of OmniSciDB with/without D-Cubicle. (a) SF20; (b) SF100; (c) SF200

Fig.11 Performance of the self-adaptive strategy (SF200)

Fig.12 Comparison of the real-time transfer speeds of large data blocks

Fig.13 Performance evaluation of D-Cubicle in four-GPU environment with different data sizes. (a) SF20; (b) SF100; (c) SF200

1	V, Rosenfeld S, Breß V Markl . Query processing on heterogeneous CPU/GPU systems. ACM Computing Surveys, 2023, 55( 1): 11
2	Kaldewey T, Lohman G, Mueller R, Volk P. GPU join processing revisited. In: Proceedings of the 8th International Workshop on Data Management on New Hardware. 2012, 55–62
3	R, Rui Y C Tu . Fast Equi-join algorithms on GPUs: design and implementation. In: Proceedings of the 29th International Conference on Scientific and Statistical Database Management. 2017, 17
4	P, Chrysogelos P, Sioulas A Ailamaki . Hardware-conscious query processing in GPU-accelerated analytical engines. In: Proceedings of the 9th Biennial Conference on Innovative Data Systems Research. 2019, 1–9
5	Sioulas P, Chrysogelos P, Karpathiotakis M, Appuswamy R, Ailamaki A. Hardware-conscious hash-joins on GPUs. In: Proceedings of the 35th IEEE International Conference on Data Engineering. 2019, 698–709
6	P, Chrysogelos M, Karpathiotakis R, Appuswamy A Ailamaki . HetExchange: encapsulating heterogeneous CPU-GPU parallelism in JIT compiled engines. Proceedings of the VLDB Endowment, 2019, 12( 5): 544–556
7	J, Paul B, He S, Lu C T Lau . Revisiting hash join on graphics processors: a decade later. Distributed and Parallel Databases, 2020, 38( 4): 771–793
8	Nam Y M N, Han D H, Kim M S K. SPRINTER: a fast n-ary join query processing method for complex OLAP queries. In: Proceedings of 2020 ACM SIGMOD International Conference on Management of Data. 2020, 2055–2070
9	Paul J, Lu S, He B, Lau C T. MG-Join: a scalable join for massively parallel multi-GPU architectures. In: Proceedings of 2021 International Conference on Management of Data. 2021, 1413–1425
10	J, Jung D, Park Y, Do J, Park J Lee . Overlapping host-to-device copy and computation using hidden unified memory. In: Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 2020, 321–335
11	Koliousis A, Weidlich M, Fernandez R C, Wolf A L, Costa P, Pietzuch P. SABER: window-based hybrid stream processing for heterogeneous architectures. In: Proceedings of 2016 International Conference on Management of Data. 2016, 555–569
12	I, Arefyeva D, Broneske G, Campero M, Pinnecke G Saake . Memory management strategies in CPU/GPU database systems: a survey. In: Proceedings of the 14th International Conference on Beyond Databases, Architectures and Structures. Facing the Challenges of Data Proliferation and Growing Variety. 2018, 128–142
13	A, Li S L, Song J, Chen J, Li X, Liu N R, Tallent K J Barker . Evaluating modern GPU interconnect: PCIe, NVLink, NV-SLI, NVSwitch and GPUDirect. IEEE Transactions on Parallel and Distributed Systems, 2020, 31( 1): 94–110
14	L, Li B Chapman . Compiler assisted hybrid implicit and explicit GPU memory management under unified address space. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 2019, 51
15	Lutz C, Breß S, Zeuch S, Rabl T, Markl V. Pump up the volume: processing large data on GPUs with fast interconnects. In: Proceedings of 2020 ACM SIGMOD International Conference on Management of Data. 2020, 1633–1649
16	Lutz C, Breß S, Zeuch S, Rabl T, Markl V. Triton join: efficiently scaling to a large join state on GPUs with fast interconnects. In: Proceedings of 2022 International Conference on Management of Data. 2022, 1017–1032
17	Kim H, Sim J, Gera P, Hadidi R, Kim H. Batch-aware unified memory management in GPUs for irregular workloads. In: Proceedings of the 25th International Conference on Architectural Support for Programming Languages and Operating Systems. 2020, 1357–1370
18	R, Lee M, Zhou C, Li S, Hu J, Teng D, Li X Zhang . The art of balance: a RateupDB™ experience of building a CPU/GPU hybrid database product. Proceedings of the VLDB Endowment, 2021, 14( 12): 2999–3013
19	J, Jung D, Park G, Jo J, Park J Lee . SnuRHAC: a runtime for heterogeneous accelerator clusters with CUDA unified memory. In: Proceedings of the 30th International Symposium on High-Performance Parallel and Distributed Computing. 2021, 107–120
20	S, Cho J, Hong J, Choi H Han . Multithreaded double queuing for balanced CPU-GPU memory copying. In: Proceedings of the 34th ACM/SIGAPP Symposium on Applied Computing. 2019, 1444–1450
21	B, He M, Lu K, Yang R, Fang N K, Govindaraju Q, Luo P V Sander . Relational query coprocessing on graphics processors. ACM Transactions on Database Systems, 2009, 34( 4): 21
22	K, Wang K, Zhang Y, Yuan S, Ma R, Lee X, Ding X Zhang . Concurrent analytical query processing with GPUs. Proceedings of the VLDB Endowment, 2014, 7( 11): 1011–1022
23	J, Paul J, He B He . GPL: A GPU-based pipelined query processing engine. In: Proceedings of the 2016 International Conference on Management of Data. 2016, 1935–1950
24	S Breß . The design and implementation of CoGaDB: a column-oriented GPU-accelerated DBMS. Datenbank-Spektrum, 2014, 14( 3): 199–209
25	S, Breß G Saake . Why it is time for a HyPE: a hybrid query processing engine for efficient GPU coprocessing in DBMS. Proceedings of the VLDB Endowment, 2013, 6( 12): 1398–1403
26	S, Breß B, Köcher M, Heimel V, Markl M, Saecker G Saake . Ocelot/HyPE: optimized data processing on heterogeneous hardware. Proceedings of the VLDB Endowment, 2014, 7( 13): 1609–1612
27	C, Guo H, Chen F, Zhang C Li . Distributed join algorithms on multi-CPU clusters with GPUDirect RDMA. In: Proceedings of the 48th International Conference on Parallel Processing. 2019, 65
28	R, Rui H, Li Y C Tu . Efficient join algorithms for large database tables in a multi-GPU environment. Proceedings of the VLDB Endowment, 2020, 14( 4): 708–720
29	N, Hou F, He Y, Zhou Y Chen . An efficient GPU-based parallel tabu search algorithm for hardware/software co-design. Frontiers of Computer Science, 2020, 14( 5): 145316
30	Y, Chen F, He H, Li D, Zhang Y Wu . A full migration BBO algorithm with enhanced population quality bounds for multimodal biomedical image registration. Applied Soft Computing, 2020, 93: 106335
31	Y, Liang F, He X, Zeng J Luo . An improved loop subdivision to coordinate the smoothness and the number of faces via multi-objective optimization. Integrated Computer-Aided Engineering, 2022, 29( 1): 23–41

[1]

FCS-22160-OF-JW_suppl_1

Download

[1]	Hongru GAO, Xiaofei LIAO, Zhiyuan SHAO, Kexin LI, Jiajie CHEN, Hai JIN. A survey on dynamic graph processing on GPUs: concepts, terminologies and systems[J]. Front. Comput. Sci., 2024, 18(4): 184106-.
[2]	Xin YOU, Hailong YANG, Zhongzhi LUAN, Depei QIAN. Accelerating the cryo-EM structure determination in RELION on GPU cluster[J]. Front. Comput. Sci., 2022, 16(3): 163102-.
[3]	Dunbo ZHANG, Chaoyang JIA, Li SHEN. Compressed page walk cache[J]. Front. Comput. Sci., 2022, 16(3): 163104-.
[4]	Neng HOU, Fazhi HE, Yi ZHOU, Yilin CHEN. An efficient GPU-based parallel tabu search algorithm for hardware/software co-design[J]. Front. Comput. Sci., 2020, 14(5): 145316-.
[5]	Shiqing ZHANG, Zheng QIN, Yaohua YANG, Li SHEN, Zhiying WANG. Transparent partial page migration between CPU and GPU[J]. Front. Comput. Sci., 2020, 14(3): 143101-.
[6]	Qi ZHU, Bo WU, Xipeng SHEN, Kai SHEN, Li SHEN, Zhiying WANG. Resolving the GPU responsiveness dilemma through program transformations[J]. Front. Comput. Sci., 2018, 12(3): 545-559.
[7]	Qi ZHU,Bo WU,Xipeng SHEN,Kai SHEN,Li SHEN,Zhiying WANG. Understanding co-run performance on CPU-GPU integrated processors: observations, insights, directions[J]. Front. Comput. Sci., 2017, 11(1): 130-146.
[8]	Chenggang Clarence YAN,Hui YU,Weizhi XU,Yingping ZHANG,Bochuan CHEN,Zhu TIAN,Yuxuan WANG,Jian YIN. Memory bandwidth optimization of SpMV on GPGPUs[J]. Front. Comput. Sci., 2015, 9(3): 431-441.

Viewed

Full text

Abstract

Cited

Shared

Discussed