Please wait a minute...
Frontiers of Computer Science

ISSN 2095-2228

ISSN 2095-2236(Online)

CN 10-1014/TP

Postal Subscription Code 80-970

2018 Impact Factor: 1.129

Front. Comput. Sci.    2023, Vol. 17 Issue (4) : 174104    https://doi.org/10.1007/s11704-022-1749-6
RESEARCH ARTICLE
swSpAMM: optimizing large-scale sparse approximate matrix multiplication on Sunway Taihulight
Xiaoyan LIU1,2, Yi LIU2, Bohong YIN2, Hailong YANG1,2(), Zhongzhi LUAN2, Depei QIAN2
1. State Key Laboratory of Software Development Environment, Beijing 100191, China
2. School of Computer Science and Engineering, Beihang University, Beijing 100191, China
 Download: PDF(16049 KB)   HTML
 Export: BibTeX | EndNote | Reference Manager | ProCite | RefWorks
Abstract

Although matrix multiplication plays an essential role in a wide range of applications, previous works only focus on optimizing dense or sparse matrix multiplications. The Sparse Approximate Matrix Multiply (SpAMM) is an algorithm to accelerate the multiplication of decay matrices, the sparsity of which is between dense and sparse matrices. In addition, large-scale decay matrix multiplication is performed in scientific applications to solve cutting-edge problems. To optimize large-scale decay matrix multiplication using SpAMM on supercomputers such as Sunway Taihulight, we present swSpAMM, an optimized SpAMM algorithm by adapting the computation characteristics to the architecture features of Sunway Taihulight.

Specifically, we propose both intra-node and inter-node optimizations to accelerate swSpAMM for large-scale execution. For intra-node optimizations, we explore algorithm parallelization and block-major data layout that are tailored to better utilize the architecture advantage of Sunway processor. For inter-node optimizations, we propose a matrix organization strategy for better distributing sub-matrices across nodes and a dynamic scheduling strategy for improving load balance across nodes. We compare swSpAMM with the existing GEMM library on a single node as well as large-scale matrix multiplication methods on multiple nodes. The experiment results show that swSpAMM achieves a speedup up to 14.5× and 2.2× when compared to xMath library on a single node and 2D GEMM method on multiple nodes, respectively.

Keywords approximate calculation      sunway processor      performance optimization     
Corresponding Author(s): Hailong YANG   
Just Accepted Date: 16 May 2022   Issue Date: 03 November 2022
 Cite this article:   
Xiaoyan LIU,Yi LIU,Bohong YIN, et al. swSpAMM: optimizing large-scale sparse approximate matrix multiplication on Sunway Taihulight[J]. Front. Comput. Sci., 2023, 17(4): 174104.
 URL:  
https://academic.hep.com.cn/fcs/EN/10.1007/s11704-022-1749-6
https://academic.hep.com.cn/fcs/EN/Y2023/V17/I4/174104
Fig.1  The architecture of Sunway processor. Left: The overall architecture. Right: The CPE mesh grid
  
2D 2.5D 3D COSMA
Required Nodes Less Medium High Less
Communications More Medium Less Depends
Flexibility High Medium Less High
Tab.1  The differences of GEMM frameworks
Fig.2  The imbalance workload in a single node. The matrix multiplication in white means the multiplication can be skipped, and the blue means the actual multiplication (can not be skipped)
  
Fig.3  We suppose LN=2 and both A in 4×4. The CPE load A from A to perform the atomic task. (a) Store matrix in row-major; (b) store matrix in column-major; (c) store matrix in block-major. The number in the matrix is the address offset from the first element in the array
Fig.4  swSpAMM reorders the loop iteration to eliminate SIMD stride load. The LN=4 and SIMD length is 4 in this example
Fig.5  The two-level distributed data partition strategy of swSpAMM. The At is the first level blocking matrix, and A is the second level blocking matrix, and Ab is the buffer for communicating A. We suppose P = 4 in this figure
Fig.6  We use a heat map to show the number of actual matrix multiplication needed by each fine-grained matrix of C. For example, the A and B are in 32768×32768, and the valid rate is 25%
Fig.7  The master-worker schedule model of swSpAMM
  
  
Implementation and valid ratio N = 1024 N = 2048 N = 4096 N = 8192 N = 16384
swSpAMM/100% 0.020844 0.159756 1.242582 9.902074 79.31162
swSpAMM/75% 0.016782 0.120488 0.946127 7.544022 59.53522
swSpAMM/50% 0.011284 0.08381 0.64348 5.069342 40.4408
swSpAMM/25% 0.006586 0.045322 0.337227 2.643518 20.49286
swSpAMM/15% 0.004584 0.030082 0.214267 1.642496 13.00652
swSpAMM/5% 0.002572 0.014621 0.094555 0.688117 5.239222
xMath(GEMM) 0.007699 0.059382 0.491574 3.94961 31.86116
Tab.2  The performance (in seconds) of swSpAMM and xMath for matrices with algebraic decay
Implementation and valid ratio N = 1024 N = 2048 N = 4096 N = 8192 N = 16384
swSpAMM/25% 0.006619 0.045794 0.34208 2.658852 21.00916
swSpAMM/15% 0.004656 0.030391 0.218922 1.678066 13.17432
swSpAMM/5% 0.002609 0.014772 0.096508 0.691492 5.316884
swSpAMM/1% 0.001707 0.008075 0.045087 0.304195 2.196088
xMath(GEMM) 0.007691 0.059359 0.491493 3.949868 31.85852
Tab.3  The performance (in seconds) of swSpAMM and xMath for matrices with exponential decay
Fig.8  The speedup of swSpAMM with algebraic decay
Fig.9  The speedup of swSpAMM with exponential decay
Implementation N = 1024 N = 2048 N = 4096 N = 8192 N = 16384
Basic 0.031865 0.240756 1.89257 15.0027 119.829
Architecture-specific 0.017049 0.125609 0.980016 7.70376 61.7126
Architecture-specific Block-major data layout 0.012372 0.093339 0.722906 5.69697 45.3532
Architecture-specific (opt1)Block-major data layout (opt2)Parallelization exploration (opt3) 0.011284 0.08381 0.64348 5.069342 40.4408
Tab.4  The performance (in seconds) of the implementations in ablation study
Fig.10  The performance result of ablation study. opt1 means SIMD optimization, opt2 means block-major data layout, and opt3 means intra-node load balancing
Nodes N = 8192 N = 16384 N = 32768 N = 65536 N = 131072
4 1.9526 13.452
8 1.4546 8.9092
16 0.9374 5.8864 41.677
32 0.6553 3.5677 22.399
64 0.3568 2.0592 13.032 92.038
128 1.4531 8.1913 51.82
256 0.7818 4.3437 26.848 187.04
512 3.0037 16.937 107.91
Tab.5  The performance (in seconds) of 2D framework
Nodes N = 8192 N = 16384 N = 32768 N = 65536 N = 131072
4 1.9528 13.45
8 0.9466 6.6422
16 0.9454 5.8943 41.753
32 0.4529 2.898 20.762
64 0.2347 1.501 10.612 92.482
128 1.0257 6.4939 46.062
256 0.5126 3.2726 23.166 187.23
512 2.1272 13.311 92.964
Tab.6  The performance (in seconds) of 2.5D framework
Nodes N = 8192 N = 16384 N = 32768 N = 65536 N = 131072
8 0.9451 6.6416 49.961
64 0.2194 1.4348 10.361 77.326
512 0.0493 0.2637 1.6654 11.584 85.988
Tab.7  The performance (in seconds) of 3D framework
Nodes N = 8192 N = 16384 N = 32768 N = 65536 N = 131072
4 1.9879 15.016 119.33
8 1.0204 7.5894 62.057
16 0.5665 3.928 30.612 242.45
32 0.3705 1.8628 15.572 120.78
64 0.2006 1.0659 7.6402 62.472 494.92
128 0.7238 3.9433 29.186 249.29
256 0.377 1.979 14.613 113.13
512 4.6493 7.5753 57.539
Tab.8  The performance (in seconds) of swSpAMM
Fig.11  The performance comparison with matrix size N = 8192
Fig.12  The performance comparison with matrix size N = 16384
Fig.13  The performance comparison with matrix size N = 32768
Fig.14  The performance comparison with matrix size N = 65536
Fig.15  The performance comparison with matrix size N = 131072
  
  
  
  
  
  
1 T, Ben-Nun T Hoefler . Demystifying parallel and distributed deep learning: an in-depth concurrency analysis. ACM Computing Surveys, 2020, 52( 4): 65
2 A, Azad , Buluç, AJ Gilbert . Parallel triangle counting and enumeration using matrix algebra. In: Proceedings of 2015 IEEE International Parallel and Distributed Processing Symposium Workshop. 2015, 804–811
3 Ben M, Del O, Schütt T, Wentz P, Messmer J, Hutter J VandeVondele . Enabling simulation at the fifth rung of DFT: large scale RPA calculations with excellent time to solution. Computer Physics Communications, 2015, 187: 120–129
4 X P, Li R W, Nunes D Vanderbilt . Density-matrix electronic-structure method with linear system-size scaling. Physical Review B, 1993, 47( 16): 10891–10894
5 M Challacombe . A general parallel sparse-blocked matrix multiply for linear scaling SCF theory. Computer Physics Communications, 2000, 128( 1−2): 93–107
6 E H, Rubensson E, Rudberg P Salek . Methods for Hartree-Fock and density functional theory electronic structure calculations with linearly scaling processor time and memory usage. In: Zalesny R, Papadopoulos M G, Mezey P G, Leszczynski J, eds. Linear-Scaling Techniques in Computational Chemistry and Physics. Dordrecht: Springer, 2011, 263–300
7 T, Gale M, Zaharia C, Young E Elsen . Sparse GPU kernels for deep learning. In: Proceedings of SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. 2020, 1–14
8 X, Liu Y, Liu H, Yang M, Dun B, Yin Z, Luan D Qian . Accelerating approximate matrix multiplication for near-sparse matrices on GPUs. The Journal of Supercomputing, 2022, doi:
9 S, Demko W F, Moss P W Smith . Decay rates for inverses of band matrices. Mathematics of Computation, 1984, 43( 168): 491–499
10 M, Benzi P, Boito N Razouk . Decay properties of spectral projectors with applications to electronic structure. SIAM Review, 2013, 55( 1): 3–64
11 D R, Bowler T Miyazaki . O(N) methods in electronic structure calculations. Reports on Progress in Physics, 2012, 75( 3): 036503
12 B, Kirchner Dio P J, di J Hutter . Real-world predictions from ab initio molecular dynamics simulations. In: Kirchner B, Vrabec J, eds. Multiscale Molecular Methods in Applied Chemistry. Berlin: Springer, 2011, 109–153
13 M, Cramer J Eisert . Correlations, spectral gap and entanglement in harmonic quantum systems on generic lattices. New Journal of Physics, 2006, 8( 5): 71
14 M, Cramer J, Eisert M B, Plenio J Dreißig . Entanglement-area law for general bosonic harmonic lattice systems. Physical Review A, 2006, 73( 1): 012309
15 J, Eisert M, Cramer M B Plenio . Area laws for the entanglement entropy - a review. 2008, arXiv preprint arXiv: 0808.3773
16 N, Schuch J I, Cirac M M Wolf . Quantum states on harmonic lattices. Communications in Mathematical Physics, 2006, 267( 1): 65–92
17 A, Buluç J R Gilbert . Parallel sparse matrix-matrix multiplication and indexing: implementation and experiments. SIAM Journal on Scientific Computing, 2012, 34( 4): C170–C191
18 E J, Im K Yelick . Optimizing sparse matrix computations for register reuse in SPARSITY. In: Proceedings of International Conference on Computational Science. 2001, 127–136
19 M, Challacombe N Bock . Fast multiplication of matrices with decay. 2010, arXiv preprint arXiv: 1011.3534
20 N, Bock M, Challacombe L V Kalé . Solvers for O(N) electronic structure in the strong scaling limit. SIAM Journal on Scientific Computing, 2016, 38( 1): C1–C21
21 E, Rudberg E H, Rubensson P, Sałek A Kruchinina . Ergo: an open-source program for linear-scaling electronic structure calculations. SoftwareX, 2018, 7: 107–111
22 L E Cannon . A cellular computer to implement the Kalman filter algorithm. Montana State University, Dissertation, 1969
23 L S, Blackford J, Choi A, Cleary E, D’Azeuedo J, Demmel I, Dhillon S, Hammarling G, Henry A, Petitet K, Stanley D, Walker R C, Whaley J J Dongarra . ScaLAPACK User’s Guide. Philadelphia: Society for Industrial and Applied Mathematics, 1997
24 E, Solomonik J Demmel . Communication-optimal parallel 2.5D matrix multiplication and LU factorization algorithms. In: Proceedings of the 17th International Euro-ParConference. 2011, 90–109
25 A, Lazzaro J, VandeVondele J, Hutter O Schütt . Increasing the efficiency of sparse matrix-matrix multiplication with a 2.5D algorithm and one-sided MPI. In: Proceedings of Platform for Advanced Scientific Computing Conference. 2017, 3
26 M, Moldaschl K E, Prikopa W N Gansterer . Fault tolerant communication-optimal 2.5D matrix multiplication. Journal of Parallel and Distributed Computing, 2017, 104: 179–190
27 R C, Agarwal S M, Balle F G, Gustavson M, Joshi P Palkar . A three-dimensional approach to parallel matrix multiplication. IBM Journal of Research and Development, 1995, 39( 5): 575–582
28 J, Siegel O, Villa S, Krishnamoorthy A, Tumeo X Li . Efficient sparse matrix-matrix multiplication on heterogeneous high performance systems. In: Proceedings of 2010 IEEE International Conference on Cluster Computing Workshops and Posters (CLUSTER WORKSHOPS). 2010, 1–8
29 H, Fu J, Liao J, Yang L, Wang Z, Song X, Huang C, Yang W, Xue F, Liu F, Qiao W, Zhao X, Yin C, Hou C, Zhang W, Ge J, Zhang Y, Wang C, Zhou G Yang . The Sunway Taihulight supercomputer: system and applications. Science China Information Sciences, 2016, 59( 7): 072001
30 H, Fu J, Liao W, Xue L, Wang D, Chen L, Gu J, Xu N, Ding X, Wang C, He S, Xu Y, Liang J, Fang Y, Xu W, Zheng J, Xu Z, Zheng W, Wei X, Ji H, Zhang B, Chen K, Li X, Huang W, Chen G Yang . Refactoring and optimizing the community atmosphere model (CAM) on the Sunway Taihulight supercomputer. In: SC’16: Proceedings of International Conference for High Performance Computing, Networking, Storage and Analysis. 2016, 969–980
31 H, Lin X, Zhu B, Yu X, Tang W, Xue W, Chen L, Zhang T, Hoefler X, Ma X, Liu W, Zheng J Xu . ShenTu: processing multi-trillion edge graphs on millions of cores in seconds. In: Proceedings of SC18: International Conference for High Performance Computing, Networking, Storage and Analysis. 2018, 706–716
32 H, Yue L, Deng D, Meng Y, Wang Y Sun . Parallelization and optimization of large-scale CFD simulations on Sunway Taihulight system. In: Proceedings of the 13th Conference on Advanced Computer Architecture. 2020, 260–274
33 C, Yang W, Xue H, Fu H, You X, Wang Y, Ao F, Liu L, Gan P, Xu L, Wang G, Yang W Zheng . 10M-core scalable fully-implicit solver for nonhydrostatic atmospheric dynamics. In: SC’16: Proceedings of International Conference for High Performance Computing, Networking, Storage and Analysis. 2016, 57–68
34 Z, Xu J, Lin S Matsuoka . Benchmarking SW26010 many-core processor. In: Proceedings of 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). 2017, 743–752
35 W, Gropp E, Lusk A Skjellum . Using MPI: Portable Parallel Programming with the Message Passing Interface. Cambridge: MIT Press, 1999
36 G, Kwasniewski M, Kabić M, Besta J, VandeVondele R, Solcà T Hoefler . Red-blue pebbling revisited: near optimal parallel matrix-matrix multiplication. In: Proceedings of International Conference for High Performance Computing, Networking, Storage and Analysis. 2019, 24
37 R, Girshick J, Donahue T, Darrell J Malik . Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of 2014 IEEE Conference on Computer Vision and Pattern Recognition. 2014, 580–587
38 A Artemov . Sparse approximate matrix multiplication in a fully recursive distributed task-based parallel framework. 2019, arXiv preprint arXiv: 1906.08148
39 L V, Kale S Krishnan . CHARM++: a portable concurrent object oriented system based on C++. In: Proceedings of the 8th Annual Conference on Object-Oriented Programming Systems, Languages, and Applications. 1993, 91–108
40 L, Dagum R Menon . OpenMP: an industry standard API for shared-memory programming. IEEE Computational Science and Engineering, 1998, 5( 1): 46–55
41 E H, Rubensson E Rudberg . Chunks and tasks: a programming model for parallelization of dynamic algorithms. Parallel Computing, 2014, 40( 7): 328–343
42 C, Liu B, Xie X, Liu W, Xue H, Yang X Liu . Towards efficient SpMV on Sunway Manycore architectures. In: Proceedings of 2018 International Conference on Supercomputing. 2018, 363–373
43 M, Dun Y, Li Q, Sun H, Yang W, Li Z, Luan L, Gan G, Yang D Qian . Towards efficient canonical polyadic decomposition on Sunway many-core processor. Information Sciences, 2021, 549: 221–248
44 X, Zhong M, Li H, Yang Y, Liu D Qian . swMR: a framework for accelerating MapReduce applications on Sunway Taihulight. IEEE Transactions on Emerging Topics in Computing, 2021, 9( 2): 1020–1030
45 Q, Han H, Yang M, Dun Z, Luan L, Gan G, Yang D Qian . Towards efficient tile low-rank GEMM computation on Sunway many-core processors. The Journal of Supercomputing, 2021, 77( 5): 4533–4564
46 M, Li Y, Liu H, Yang Y, Hu Q, Sun B, Chen X, You X, Liu Z, Luan D Qian . Automatic code generation and optimization of large-scale stencil computation on many-core processors. In: Proceedings of the 50th International Conference on Parallel Processing. 2021, 34
47 Y, Hu H, Yang Z, Luan L, Gan G, Yang D Qian . Massively scaling seismic processing on Sunway Taihulight supercomputer. IEEE Transactions on Parallel and Distributed Systems, 2020, 31( 5): 1194–1208
48 M, Li Y, Liu H, Yang Z, Luan L, Gan G, Yang D Qian . Accelerating sparse cholesky factorization on Sunway Manycore architecture. IEEE Transactions on Parallel and Distributed Systems, 2020, 31( 7): 1636–1650
49 X, Wang W, Liu W, Xue L Wu . swSpTRSV: a fast sparse triangular solve with sparse level tile layout on Sunway architectures. In: Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 2018, 338–353
[1] FCS-21749-OF-XL_suppl_1 Download
[1] Mingzhen LI, Changxi LIU, Jianjin LIAO, Xuegui ZHENG, Hailong YANG, Rujun SUN, Jun XU, Lin GAN, Guangwen YANG, Zhongzhi LUAN, Depei QIAN. Towards optimized tensor code generation for deep learning on sunway many-core processor[J]. Front. Comput. Sci., 2024, 18(2): 182101-.
[2] Rong ZENG, Xiaofeng HOU, Lu ZHANG, Chao LI, Wenli ZHENG, Minyi GUO. Performance optimization for cloud computing systems in the microservice era: state-of-the-art and research opportunities[J]. Front. Comput. Sci., 2022, 16(6): 166106-.
[3] Xin YOU, Hailong YANG, Zhongzhi LUAN, Depei QIAN. Accelerating the cryo-EM structure determination in RELION on GPU cluster[J]. Front. Comput. Sci., 2022, 16(3): 163102-.
[4] Haitao WANG, Zhanhuai LI, Xiao ZHANG, Xiaonan ZHAO, Song JIANG. WOBTree: a write-optimized B+-tree for non-volatile memory[J]. Front. Comput. Sci., 2021, 15(5): 155106-.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed