|
|
Towards optimized tensor code generation for deep learning on sunway many-core processor |
Mingzhen LI1,2, Changxi LIU3, Jianjin LIAO1, Xuegui ZHENG1, Hailong YANG1,2(), Rujun SUN4, Jun XU5, Lin GAN6, Guangwen YANG6, Zhongzhi LUAN1, Depei QIAN1 |
1. State Key Laboratory of Software Development Environment, Beijing 100191, China 2. School of Computer Science and Engineering, Beihang University, Beijing 100191, China 3. National University of Singapore, Singapore 119077, Singapore 4. State Key Laboratory of Mathematical Engineering and Advanced Computing, Wuxi 214000, China 5. Science and Technology on Special System Simulation Laboratory Beijing Simulation Center, Beijing 100854, China 6. Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China |
|
|
Abstract The flourish of deep learning frameworks and hardware platforms has been demanding an efficient compiler that can shield the diversity in both software and hardware in order to provide application portability. Among the existing deep learning compilers, TVM is well known for its efficiency in code generation and optimization across diverse hardware devices. In the meanwhile, the Sunway many-core processor renders itself as a competitive candidate for its attractive computational power in both scientific computing and deep learning workloads. This paper combines the trends in these two directions. Specifically, we propose swTVM that extends the original TVM to support ahead-of-time compilation for architecture requiring cross-compilation such as Sunway. In addition, we leverage the architecture features during the compilation such as core group for massive parallelism, DMA for high bandwidth memory transfer and local device memory for data locality, in order to generate efficient codes for deep learning workloads on Sunway. The experiment results show that the codes generated by swTVM achieve 1.79 improvement of inference latency on average compared to the state-of-the-art deep learning framework on Sunway, across eight representative benchmarks. This work is the first attempt from the compiler perspective to bridge the gap of deep learning and Sunway processor particularly with productivity and efficiency in mind. We believe this work will encourage more people to embrace the power of deep learning and Sunway many-core processor.
|
Keywords
sunway processor
deep learning compiler
code generation
performance optimization
|
Corresponding Author(s):
Hailong YANG
|
About author: Changjian Wang and Zhiying Yang contributed equally to this work. |
Just Accepted Date: 21 November 2022
Issue Date: 27 February 2023
|
|
1 |
M, Bojarski Testa D, Del D, Dworakowski B, Firner B, Flepp P, Goyal L D, Jackel M, Monfort U, Muller J K, Zhang X, Zhang J, Zhao K Zieba . End to end learning for self-driving cars. 2016, arXiv preprint arXiv: 1604.07316
|
2 |
K P, Zhang Z P, Zhang Z F, Li Y Qiao . Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Processing Letters, 2016, 23( 10): 1499–1503
|
3 |
Cho K, Van Merrienboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, Bengio Y. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In: Proceedings of 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2014
|
4 |
M, Abadi P, Barham J M, Chen Z F, Chen A, Davis J, Dean M, Devin S, Ghemawat G, Irving M, Isard M, Kudlur J, Levenberg R, Monga S, Moore D G, Murray B, Steiner P, Tucker V, Vasudevan P, Warden M, Wicke Y, Yu X Q Zheng . Tensorflow: a system for large-scale machine learning. In: Proceedings of the 12th USENIX conference on Operating Systems Design and Implementation. 2016, 265−283
|
5 |
A, Paszke S, Gross F, Massa A, Lerer J, Bradbury G, Chanan T, Killeen Z M, Lin N, Gimelshein L, Antiga A, Desmaison A, Köpf E, Yang Z, DeVito M, Raison A, Tejani S, Chilamkurthy B, Steiner L, Fang J J, Bai S Chintala . PyTorch: an imperative style, high-performance deep learning library. In: Proceedings of the 33rd International Conference on Neural Information Processing Systems. 2019, 8026−8037
|
6 |
T Q, Chen M, Li Y T, Li M, Lin N Y, Wang M, Wang Xiao T J, J B, Xu C Y, Zhang Z Zhang . Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems. 2015, arXiv preprint arXiv: 1512.01274
|
7 |
Y, Jia E, Shelhamer J, Donahue S, Karayev J, Long R, Girshick S, Guadarrama T Darrell . Caffe: Convolutional architecture for fast feature embedding. In: Proceedings of the 22nd ACM international conference on Multimedia. 2014, 675−678
|
8 |
C, Wang L, Gong Q, Yu X, Li Y, Xie X H Zhou . DLAU: A scalable deep learning accelerator unit on FPGA. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2017, 36( 3): 513–517
|
9 |
Jouppi N P, Young C, Patil N, Patterson D, Agrawal G, Bajwa R, Bates S, Bhatia S, Boden N, Borchers A. In-datacenter performance analysis of a tensor processing unit. In: Proceedings of the 44th ACM/IEEE Annual International Symposium on Computer Architecture. 2017, 1−12
|
10 |
Chetlur S, Woolley C, Vandermersch P, Cohen J, Tran J, Catanzaro B, Shelhamer E. cuDNN: efficient primitives for deep learning. 2014, arXiv preprint arXiv: 1410.0759
|
11 |
E D, Wang Q, Zhang B, Shen G Y, Zhang X W, Lu Q, Wu Y J Wang . Intel math kernel library. In: Wang E D, Zhang Q, Shen B, Zhang G Y, Lu X W, Wu Q, Wang Y J, eds. High-Performance Computing on the Intel® Xeon Phi™. Cham: Springer, 2014, 167−188
|
12 |
N, Rotem J, Fix S, Abdulrasool G, Catron S, Deng R, Dzhabarov N, Gibson J, Hegeman M, Lele R, Levenstein J, Montgomery B, Maher S, Nadathur J, Olesen J, Park A, Rakhov M, Smelyanskiy M Wang . Glow: graph lowering compiler techniques for neural networks. 2018, arXiv preprint arXiv: 1805.00907
|
13 |
Cyphers S, Bansal A, Bhiwandiwalla A, Bobba J, Brookhart M, Chakraborty A, Constable W, Convey C, Cook L, Kanawi O, Kimball O, Knight J, Korovaiko N, Kumar V, Lao Y X, Lishka C R, Menon J, Jennifer Myers, Narayana S A, Procter A, Webb T J. Intel nGraph: an intermediate representation, compiler, and executor for deep learning. 2018, arXiv preprint arXiv: 1801.08058
|
14 |
N, Vasilache O, Zinenko T, Theodoridis P, Goyal Z, DeVito W S, Moses S, Verdoolaege A, Adams A Cohen . Tensor comprehensions: framework-agnostic high-performance machine learning abstractions. 2018, arXiv preprint arXiv: 1802.04730
|
15 |
T Q, Chen T, Moreau Z H, Jiang L M, Zheng E Q, Yan H C, Shen M, Cowan L Y, Wang Y W, Hu L, Ceze C, Guestrin A Krishnamurthy . TVM: an automated end-to-end optimizing compiler for deep learning. In: Proceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation. 2018, 578−594
|
16 |
Baghdadi R, Ray J, Ben Romdhane M, Del Sozzo E, Akkas A, Zhang Y M, Suriana P, Kamil S, Amarasinghe S. Tiramisu: a polyhedral compiler for expressing fast and portable code. In: Proceedings of 2019 IEEE/ACM International Symposium on Code Generation and Optimization. 2019, 193−205
|
17 |
M Z, Li Y, Liu X Y, Liu Q X, Sun X, You H L, Yang Z Z, Luan L, Gan G W, Yang D P Qian . The deep learning compiler: a comprehensive survey. IEEE Transactions on Parallel and Distributed Systems, 2021, 32( 3): 708–727
|
18 |
H, Lin X C, Tang B W, Yu Y W, Zhuo W G, Chen J D, Zhai W W, Yin W M Zheng . Scalable graph traversal on sunway taihulight with ten million cores. In: Proceedings of 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS). 2017, 635−645
|
19 |
Liu C X, Xie B W, Liu X, Xue W, Yang H L, Liu X. Towards efficient SpMV on sunway manycore architectures. In: Proceedings of 2018 International Conference on Supercomputing. 2018, 363−373
|
20 |
Li M Z, Liu Y, Yang H L, Luan Z Z, Qian D P. Multi-role SpTRSV on sunway many-core architecture. In: 2018 IEEE the 20th International Conference on High Performance Computing and Communications; IEEE the 16th International Conference on Smart City; IEEE the 4th International Conference on Data Science and Systems (HPCC/SmartCity/DSS). 2018, 594−601
|
21 |
X L, Wang W F, Liu W, Xue L Wu . SwSpTRSV: a fast sparse triangular solve with sparse level tile layout on sunway architectures. In: Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 2018, 338−353
|
22 |
Z G, Xu J, Lin S Matsuoka . Benchmarking SW26010 many-core processor. In: Proceedings of 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). 2017, 743−752
|
23 |
J R, Fang H H, Fu W L, Zhao B W, Chen W J, Zheng G W Yang . swDNN: A library for accelerating deep learning applications on Sunway taihulight. In: Proceedings of 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS). 2017, 615−624
|
24 |
L D, Li J R, Fang H H, Fu J L, Jiang W L, Zhao C H, He X, You G W Yang . swCaffe: a parallel framework for accelerating deep learning applications on Sunway TaihuLight. In: Proceedings of 2018 IEEE International Conference on Cluster Computing (CLUSTER). 2018, 413−422
|
25 |
L X, Ma Z Q, Xie Z, Yang J L, Xue Y S, Miao W, Cui W X, Hu F, Yang L T, Zhang L D Zhou . RAMMER: enabling holistic deep learning compiler optimizations with rTasks. In: Proceedings of the 14th USENIX Conference on Operating Systems Design and Implementation. 2020, 50
|
26 |
H J, Wang J D, Zhai M Y, Gao Z X, Ma S Z, Tang L Y, Zheng Y Z, Li K Y, Rong Y Y, Chen Z H Jia . PET: optimizing tensor programs with partially equivalent transformations and automated corrections. In: Proceedings of the 15th USENIX Symposium on Operating Systems Design and Implementation. 2021, 37−54
|
27 |
Z, Zheng X D, Yang P Z, Zhao G P, Long K, Zhu F W, Zhu W Y, Zhao X Y, Liu J, Yang J D, Zhai S L, Song W Lin . AStitch: enabling a new multi-dimensional optimization space for memory-intensive ML training and inference on modern SIMT architectures. In: Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems. 2022, 359−373
|
28 |
L M, Zheng C F, Jia M M, Sun Z, Wu C H, Yu A, Haj-Ali Y D, Wang J, Yang D Y, Zhuo K, Sen J, Gonzalez I Stoica . Ansor: generating high-performance tensor programs for deep learning. In: Proceedings of the 14th USENIX Conference on Operating Systems Design and Implementation. 2020, 49
|
29 |
C, Lattner M, Amini U, Bondhugula A, Cohen A, Davis J, Pienaar R, Riddle T, Shpeisman N, Vasilache O Zinenko . MLIR: scaling compiler infrastructure for domain specific computation. In: Proceedings of 2021 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). 2021, 2−14
|
30 |
C, Lattner V Adve . LLVM: A compilation framework for lifelong program analysis & transformation. In: Proceedings of International Symposium on Code Generation and Optimization, 2004. CGO 2004. 2004, 75−86
|
31 |
Wei R, Schwartz L, Adve S V. DLVM: A modern compiler infrastructure for deep learning systems. In: Proceedings of the 6th International Conference on Learning Representations, 2018
|
32 |
J, Zhao B J, Li W, Nie Z, Geng R W, Zhang X, Gao B, Cheng C, Wu Y, Cheng Z, Li P, Di K, Zhang X F Jin . AKG: automatic kernel generation for neural processing units using polyhedral transformations. In: Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation. 2021, 1233−1248
|
33 |
K, Zhu W Y, Zhao Z, Zheng T Y, Guo P Z, Zhao J J, Bai J, Yang X Y, Liu L S, Diao W Lin . DISC: a dynamic shape compiler for machine learning workloads. In: Proceedings of the 1st Workshop on Machine Learning and Systems. 2021, 89−95
|
34 |
Z H, Jia O, Padon J, Thomas T, Warszawski M, Zaharia A Aiken . TASO: optimizing deep learning computation with automatic generation of graph substitutions. In: Proceedings of the 27th ACM Symposium on Operating Systems Principles. 2019, 47−62
|
35 |
J, Zhang C B, Zhou Y G, Wang L L, Ju Q, Du X B, Chi D S, Xu D X, Chen Y, Liu Z Liu . Extreme-scale phase field simulations of coarsening dynamics on the Sunway taihulight supercomputer. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 2016, 34–45
|
36 |
H H, Fu C H, He B W, Chen Z K, Yin Z G, Zhang W Q, Zhang T J, Zhang W, Xue W G, Liu W W, Yin G W, Yang X F Chen . 18.9-Pflops nonlinear earthquake simulation on Sunway taihulight: Enabling depiction of 18-Hz and 8-meter scenarios. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 2017, 1–12
|
37 |
C, Yang W, Xue H H, Fu H T, You X L, Wang Y L, Ao F F, Liu L, Gan P, Xu L N, Wang G W, Yang W M Zheng . 10M-core scalable fully-implicit solver for nonhydrostatic atmospheric dynamics. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 2016, 6
|
38 |
M Z, Li Y, Liu H L, Yang Z Z, Luan L, Gan G W, Yang D P Qian . Accelerating sparse cholesky factorization on Sunway manycore architecture. IEEE Transactions on Parallel and Distributed Systems, 2020, 31( 7): 1636–1650
|
|
Viewed |
|
|
|
Full text
|
|
|
|
|
Abstract
|
|
|
|
|
Cited |
|
|
|
|
|
Shared |
|
|
|
|
|
Discussed |
|
|
|
|