Towards optimized tensor code generation for deep learning on sunway many-core processor

doi:10.1007/s11704-022-2440-7

Front. Comput. Sci.

2024, Vol. 18

Issue (2) : 182101 https://doi.org/10.1007/s11704-022-2440-7

Architecture

Towards optimized tensor code generation for deep learning on sunway many-core processor

Mingzhen LI^1,², Changxi LIU³, Jianjin LIAO¹, Xuegui ZHENG¹, Hailong YANG^1,²(

), Rujun SUN⁴, Jun XU⁵, Lin GAN⁶, Guangwen YANG⁶, Zhongzhi LUAN¹, Depei QIAN¹

¹. State Key Laboratory of Software Development Environment, Beijing 100191, China
². School of Computer Science and Engineering, Beihang University, Beijing 100191, China
³. National University of Singapore, Singapore 119077, Singapore
⁴. State Key Laboratory of Mathematical Engineering and Advanced Computing, Wuxi 214000, China
⁵. Science and Technology on Special System Simulation Laboratory Beijing Simulation Center, Beijing 100854, China
⁶. Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China

Download: PDF(8439 KB) HTML
Export: BibTeX | EndNote | Reference Manager | ProCite | RefWorks

Abstract

The flourish of deep learning frameworks and hardware platforms has been demanding an efficient compiler that can shield the diversity in both software and hardware in order to provide application portability. Among the existing deep learning compilers, TVM is well known for its efficiency in code generation and optimization across diverse hardware devices. In the meanwhile, the Sunway many-core processor renders itself as a competitive candidate for its attractive computational power in both scientific computing and deep learning workloads. This paper combines the trends in these two directions. Specifically, we propose swTVM that extends the original TVM to support ahead-of-time compilation for architecture requiring cross-compilation such as Sunway. In addition, we leverage the architecture features during the compilation such as core group for massive parallelism, DMA for high bandwidth memory transfer and local device memory for data locality, in order to generate efficient codes for deep learning workloads on Sunway. The experiment results show that the codes generated by swTVM achieve 1.79 $×$ improvement of inference latency on average compared to the state-of-the-art deep learning framework on Sunway, across eight representative benchmarks. This work is the first attempt from the compiler perspective to bridge the gap of deep learning and Sunway processor particularly with productivity and efficiency in mind. We believe this work will encourage more people to embrace the power of deep learning and Sunway many-core processor.

Keywords sunway processor deep learning compiler code generation performance optimization

Corresponding Author(s): Hailong YANG

About author:

Changjian Wang and Zhiying Yang contributed equally to this work.

Just Accepted Date: 21 November 2022 Issue Date: 27 February 2023

Cite this article:

Mingzhen LI,Changxi LIU,Jianjin LIAO, et al. Towards optimized tensor code generation for deep learning on sunway many-core processor[J]. Front. Comput. Sci., 2024, 18(2): 182101.

URL:

https://academic.hep.com.cn/fcs/EN/10.1007/s11704-022-2440-7
https://academic.hep.com.cn/fcs/EN/Y2024/V18/I2/182101

Fig.1 (a) The design overview of swTVM; (b) the Sunway architecture; and (c) the automatic code generation of deep learning models on MPE and CPEs

Fig.2 AOT code generation on Sunway processor

Fig.3 An example of matrix multiplication implementation generated by swTVM with optimizations targeting Sunway

Fig.4 Buffer size dependency of matrix A, B, and C within matrix multiplication

Fig.5 Procedure of calculating the buffer size on the matrix multiplication.(a) buffer dimension with bufffer_read/bufffer_write; (b) table of buffer iterator for each tensor; (c) the equation for the sum of buffer size from all buffer iterators; (d) possible value for each buffer iterators (in power of two)

Fig.6 The illustration of DMA auto-insertion algorithm. (a) the initial states of iterators; (b) iterator i to be removed; (c) the DMA locations are determined for the tensor

Model	Task	Batch size ( $b s$ )	Input size
ResNet18	Image Classification	1, 2, 4, 8	( $b s$ ,3,224,224)
ResNet50	Image Classification	1, 2, 4, 8	( $b s$ ,3,224,224)
VGG16	Image Classification	1, 2, 4, 8	( $b s$ ,3,224,224)
YOLOv3	Object Detection	1, 2,4,8	( $b s$ ,3,416,416)
DCGAN	Image Classification	1, 2, 4, 8	( $b s$ ,100,1,1)
MobileNet	Image Classification	1, 2, 4, 8	( $b s$ ,3,224,224)
ShuffleNet	Image Classification	1, 2, 4, 8	( $b s$ ,3,224,224)
Bert-base	Question Answering	1, 2, 4, 8	( $b s$ ,seqlen=16)

Tab.1 Deep learning models in experiments

Fig.7 End-to-end performance of swTVM with two configurations of graph-level optimization, OPT=1 and OPT=4. The y-axis represents the speedup compared to swCaffe. (a) Batch size = 1; (b) batch size = 2; (c) batch size = 4; (d) batch size = 8

Fig.8 Performance of convolution, dense, and memory-intensive layers of swTVM compared to swCaffe, when batch size is set to 1

Fig.9 Roofline analysis. All benchmarks under the batch sizes of 1, 2, 4, and 8 are included

Fig.10 Compilation overhead of swTVM on Sunway processor, comparing to that of TVM on x86 CPU

1	M, Bojarski Testa D, Del D, Dworakowski B, Firner B, Flepp P, Goyal L D, Jackel M, Monfort U, Muller J K, Zhang X, Zhang J, Zhao K Zieba . End to end learning for self-driving cars. 2016, arXiv preprint arXiv: 1604.07316
2	K P, Zhang Z P, Zhang Z F, Li Y Qiao . Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Processing Letters, 2016, 23( 10): 1499–1503
3	Cho K, Van Merrienboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, Bengio Y. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In: Proceedings of 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2014
4	M, Abadi P, Barham J M, Chen Z F, Chen A, Davis J, Dean M, Devin S, Ghemawat G, Irving M, Isard M, Kudlur J, Levenberg R, Monga S, Moore D G, Murray B, Steiner P, Tucker V, Vasudevan P, Warden M, Wicke Y, Yu X Q Zheng . Tensorflow: a system for large-scale machine learning. In: Proceedings of the 12th USENIX conference on Operating Systems Design and Implementation. 2016, 265−283
5	A, Paszke S, Gross F, Massa A, Lerer J, Bradbury G, Chanan T, Killeen Z M, Lin N, Gimelshein L, Antiga A, Desmaison A, Köpf E, Yang Z, DeVito M, Raison A, Tejani S, Chilamkurthy B, Steiner L, Fang J J, Bai S Chintala . PyTorch: an imperative style, high-performance deep learning library. In: Proceedings of the 33rd International Conference on Neural Information Processing Systems. 2019, 8026−8037
6	T Q, Chen M, Li Y T, Li M, Lin N Y, Wang M, Wang Xiao T J, J B, Xu C Y, Zhang Z Zhang . Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems. 2015, arXiv preprint arXiv: 1512.01274
7	Y, Jia E, Shelhamer J, Donahue S, Karayev J, Long R, Girshick S, Guadarrama T Darrell . Caffe: Convolutional architecture for fast feature embedding. In: Proceedings of the 22nd ACM international conference on Multimedia. 2014, 675−678
8	C, Wang L, Gong Q, Yu X, Li Y, Xie X H Zhou . DLAU: A scalable deep learning accelerator unit on FPGA. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2017, 36( 3): 513–517
9	Jouppi N P, Young C, Patil N, Patterson D, Agrawal G, Bajwa R, Bates S, Bhatia S, Boden N, Borchers A. In-datacenter performance analysis of a tensor processing unit. In: Proceedings of the 44th ACM/IEEE Annual International Symposium on Computer Architecture. 2017, 1−12
10	Chetlur S, Woolley C, Vandermersch P, Cohen J, Tran J, Catanzaro B, Shelhamer E. cuDNN: efficient primitives for deep learning. 2014, arXiv preprint arXiv: 1410.0759
11	E D, Wang Q, Zhang B, Shen G Y, Zhang X W, Lu Q, Wu Y J Wang . Intel math kernel library. In: Wang E D, Zhang Q, Shen B, Zhang G Y, Lu X W, Wu Q, Wang Y J, eds. High-Performance Computing on the Intel® Xeon Phi™. Cham: Springer, 2014, 167−188
12	N, Rotem J, Fix S, Abdulrasool G, Catron S, Deng R, Dzhabarov N, Gibson J, Hegeman M, Lele R, Levenstein J, Montgomery B, Maher S, Nadathur J, Olesen J, Park A, Rakhov M, Smelyanskiy M Wang . Glow: graph lowering compiler techniques for neural networks. 2018, arXiv preprint arXiv: 1805.00907
13	Cyphers S, Bansal A, Bhiwandiwalla A, Bobba J, Brookhart M, Chakraborty A, Constable W, Convey C, Cook L, Kanawi O, Kimball O, Knight J, Korovaiko N, Kumar V, Lao Y X, Lishka C R, Menon J, Jennifer Myers, Narayana S A, Procter A, Webb T J. Intel nGraph: an intermediate representation, compiler, and executor for deep learning. 2018, arXiv preprint arXiv: 1801.08058
14	N, Vasilache O, Zinenko T, Theodoridis P, Goyal Z, DeVito W S, Moses S, Verdoolaege A, Adams A Cohen . Tensor comprehensions: framework-agnostic high-performance machine learning abstractions. 2018, arXiv preprint arXiv: 1802.04730
15	T Q, Chen T, Moreau Z H, Jiang L M, Zheng E Q, Yan H C, Shen M, Cowan L Y, Wang Y W, Hu L, Ceze C, Guestrin A Krishnamurthy . TVM: an automated end-to-end optimizing compiler for deep learning. In: Proceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation. 2018, 578−594
16	Baghdadi R, Ray J, Ben Romdhane M, Del Sozzo E, Akkas A, Zhang Y M, Suriana P, Kamil S, Amarasinghe S. Tiramisu: a polyhedral compiler for expressing fast and portable code. In: Proceedings of 2019 IEEE/ACM International Symposium on Code Generation and Optimization. 2019, 193−205
17	M Z, Li Y, Liu X Y, Liu Q X, Sun X, You H L, Yang Z Z, Luan L, Gan G W, Yang D P Qian . The deep learning compiler: a comprehensive survey. IEEE Transactions on Parallel and Distributed Systems, 2021, 32( 3): 708–727
18	H, Lin X C, Tang B W, Yu Y W, Zhuo W G, Chen J D, Zhai W W, Yin W M Zheng . Scalable graph traversal on sunway taihulight with ten million cores. In: Proceedings of 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS). 2017, 635−645
19	Liu C X, Xie B W, Liu X, Xue W, Yang H L, Liu X. Towards efficient SpMV on sunway manycore architectures. In: Proceedings of 2018 International Conference on Supercomputing. 2018, 363−373
20	Li M Z, Liu Y, Yang H L, Luan Z Z, Qian D P. Multi-role SpTRSV on sunway many-core architecture. In: 2018 IEEE the 20th International Conference on High Performance Computing and Communications; IEEE the 16th International Conference on Smart City; IEEE the 4th International Conference on Data Science and Systems (HPCC/SmartCity/DSS). 2018, 594−601
21	X L, Wang W F, Liu W, Xue L Wu . SwSpTRSV: a fast sparse triangular solve with sparse level tile layout on sunway architectures. In: Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 2018, 338−353
22	Z G, Xu J, Lin S Matsuoka . Benchmarking SW26010 many-core processor. In: Proceedings of 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). 2017, 743−752
23	J R, Fang H H, Fu W L, Zhao B W, Chen W J, Zheng G W Yang . swDNN: A library for accelerating deep learning applications on Sunway taihulight. In: Proceedings of 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS). 2017, 615−624
24	L D, Li J R, Fang H H, Fu J L, Jiang W L, Zhao C H, He X, You G W Yang . swCaffe: a parallel framework for accelerating deep learning applications on Sunway TaihuLight. In: Proceedings of 2018 IEEE International Conference on Cluster Computing (CLUSTER). 2018, 413−422
25	L X, Ma Z Q, Xie Z, Yang J L, Xue Y S, Miao W, Cui W X, Hu F, Yang L T, Zhang L D Zhou . RAMMER: enabling holistic deep learning compiler optimizations with rTasks. In: Proceedings of the 14th USENIX Conference on Operating Systems Design and Implementation. 2020, 50
26	H J, Wang J D, Zhai M Y, Gao Z X, Ma S Z, Tang L Y, Zheng Y Z, Li K Y, Rong Y Y, Chen Z H Jia . PET: optimizing tensor programs with partially equivalent transformations and automated corrections. In: Proceedings of the 15th USENIX Symposium on Operating Systems Design and Implementation. 2021, 37−54
27	Z, Zheng X D, Yang P Z, Zhao G P, Long K, Zhu F W, Zhu W Y, Zhao X Y, Liu J, Yang J D, Zhai S L, Song W Lin . AStitch: enabling a new multi-dimensional optimization space for memory-intensive ML training and inference on modern SIMT architectures. In: Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems. 2022, 359−373
28	L M, Zheng C F, Jia M M, Sun Z, Wu C H, Yu A, Haj-Ali Y D, Wang J, Yang D Y, Zhuo K, Sen J, Gonzalez I Stoica . Ansor: generating high-performance tensor programs for deep learning. In: Proceedings of the 14th USENIX Conference on Operating Systems Design and Implementation. 2020, 49
29	C, Lattner M, Amini U, Bondhugula A, Cohen A, Davis J, Pienaar R, Riddle T, Shpeisman N, Vasilache O Zinenko . MLIR: scaling compiler infrastructure for domain specific computation. In: Proceedings of 2021 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). 2021, 2−14
30	C, Lattner V Adve . LLVM: A compilation framework for lifelong program analysis & transformation. In: Proceedings of International Symposium on Code Generation and Optimization, 2004. CGO 2004. 2004, 75−86
31	Wei R, Schwartz L, Adve S V. DLVM: A modern compiler infrastructure for deep learning systems. In: Proceedings of the 6th International Conference on Learning Representations, 2018
32	J, Zhao B J, Li W, Nie Z, Geng R W, Zhang X, Gao B, Cheng C, Wu Y, Cheng Z, Li P, Di K, Zhang X F Jin . AKG: automatic kernel generation for neural processing units using polyhedral transformations. In: Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation. 2021, 1233−1248
33	K, Zhu W Y, Zhao Z, Zheng T Y, Guo P Z, Zhao J J, Bai J, Yang X Y, Liu L S, Diao W Lin . DISC: a dynamic shape compiler for machine learning workloads. In: Proceedings of the 1st Workshop on Machine Learning and Systems. 2021, 89−95
34	Z H, Jia O, Padon J, Thomas T, Warszawski M, Zaharia A Aiken . TASO: optimizing deep learning computation with automatic generation of graph substitutions. In: Proceedings of the 27th ACM Symposium on Operating Systems Principles. 2019, 47−62
35	J, Zhang C B, Zhou Y G, Wang L L, Ju Q, Du X B, Chi D S, Xu D X, Chen Y, Liu Z Liu . Extreme-scale phase field simulations of coarsening dynamics on the Sunway taihulight supercomputer. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 2016, 34–45
36	H H, Fu C H, He B W, Chen Z K, Yin Z G, Zhang W Q, Zhang T J, Zhang W, Xue W G, Liu W W, Yin G W, Yang X F Chen . 18.9-Pflops nonlinear earthquake simulation on Sunway taihulight: Enabling depiction of 18-Hz and 8-meter scenarios. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 2017, 1–12
37	C, Yang W, Xue H H, Fu H T, You X L, Wang Y L, Ao F F, Liu L, Gan P, Xu L N, Wang G W, Yang W M Zheng . 10M-core scalable fully-implicit solver for nonhydrostatic atmospheric dynamics. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 2016, 6
38	M Z, Li Y, Liu H L, Yang Z Z, Luan L, Gan G W, Yang D P Qian . Accelerating sparse cholesky factorization on Sunway manycore architecture. IEEE Transactions on Parallel and Distributed Systems, 2020, 31( 7): 1636–1650

[1]

FCS-22440-OF-ML_suppl_1

Download

[1]	Xiaoyan LIU, Yi LIU, Bohong YIN, Hailong YANG, Zhongzhi LUAN, Depei QIAN. swSpAMM: optimizing large-scale sparse approximate matrix multiplication on Sunway Taihulight[J]. Front. Comput. Sci., 2023, 17(4): 174104-.
[2]	Rong ZENG, Xiaofeng HOU, Lu ZHANG, Chao LI, Wenli ZHENG, Minyi GUO. Performance optimization for cloud computing systems in the microservice era: state-of-the-art and research opportunities[J]. Front. Comput. Sci., 2022, 16(6): 166106-.
[3]	Xin YOU, Hailong YANG, Zhongzhi LUAN, Depei QIAN. Accelerating the cryo-EM structure determination in RELION on GPU cluster[J]. Front. Comput. Sci., 2022, 16(3): 163102-.
[4]	Haitao WANG, Zhanhuai LI, Xiao ZHANG, Xiaonan ZHAO, Song JIANG. WOBTree: a write-optimized B+-tree for non-volatile memory[J]. Front. Comput. Sci., 2021, 15(5): 155106-.
[5]	Kai HU, Zhangbo DUAN, Jiye WANG, Lingchao GAO, Lihong SHANG. Template-based AADL automatic code generation[J]. Front. Comput. Sci., 2019, 13(4): 698-714.
[6]	Thierry GAUTIER, Clément GUY, Alexandre HONORAT, Paul LE GUERNIC, Jean-Pierre TALPIN, Loïc BESNARD. Polychronous automata and their use for formal validation of AADL models[J]. Front. Comput. Sci., 2019, 13(4): 677-697.
[7]	Kai HU, Teng ZHANG, Zhibin YANG. Multi-threaded code generation from Signal program to OpenMP[J]. Front. Comput. Sci., 2013, 7(5): 617-626.

Viewed

Full text

Abstract

Cited

Shared

Discussed