Please wait a minute...
Frontiers of Computer Science

ISSN 2095-2228

ISSN 2095-2236(Online)

CN 10-1014/TP

Postal Subscription Code 80-970

2018 Impact Factor: 1.129

Front. Comput. Sci.    2022, Vol. 16 Issue (3) : 163102    https://doi.org/10.1007/s11704-020-0169-8
RESEARCH ARTICLE
Accelerating the cryo-EM structure determination in RELION on GPU cluster
Xin YOU1, Hailong YANG1,2(), Zhongzhi LUAN1, Depei QIAN1
1. Sino-German Joint Software Institute, School of Computer Science and Engineering, Beihang University, Beijing 100191, China
2. State Key Laboratory of Mathematical Engineering and Advanced Computing, Wuxi 214125, China
 Download: PDF(20979 KB)   HTML
 Export: BibTeX | EndNote | Reference Manager | ProCite | RefWorks
Abstract

The cryo-electron microscopy (cryo-EM) is one of the most powerful technologies available today for structural biology. The RELION (Regularized Likelihood Optimization) implements a Bayesian algorithm for cryo-EM structure determination, which is one of the most widely used software in this field. Many researchers have devoted effort to improve the performance of RELION to satisfy the analysis for the ever-increasing volume of datasets. In this paper, we focus on performance analysis of the most time-consuming computation steps in RELION and identify their performance bottlenecks for specific optimizations. We propose several performance optimization strategies to improve the overall performance of RELION, including optimization of expectation step, parallelization of maximization step, accelerating the computation of symmetries, and memory affinity optimization. The experiment results show that our proposed optimizations achieve significant speedups of RELION across representative datasets. In addition, we perform roofline model analysis to understand the effectiveness of our optimizations.

Keywords cryo-EM structure determination      performance optimization      GPU acceleration      RELION     
Corresponding Author(s): Hailong YANG   
Just Accepted Date: 13 November 2020   Issue Date: 18 October 2021
 Cite this article:   
Xin YOU,Hailong YANG,Zhongzhi LUAN, et al. Accelerating the cryo-EM structure determination in RELION on GPU cluster[J]. Front. Comput. Sci., 2022, 16(3): 163102.
 URL:  
https://academic.hep.com.cn/fcs/EN/10.1007/s11704-020-0169-8
https://academic.hep.com.cn/fcs/EN/Y2022/V16/I3/163102
Fig.1  The parallel execution model of (a) E-Step [14] in RELION class2D, class3D and refine3D, (b) M-Step in RELION class2D and class3D and (c) M-Step in RELION refine3D
Slave Rank Total Wait Time/s Ratio within E-step/%
1 1.692 0.26
2 2.135 0.33
3 20.953 3.20
4 15.773 2.41
5 15.123 2.31
6 14.884 2.27
7 1.927 0.29
8 15.029 2.29
Tab.1  Waiting time of slave processes within E-step
Platform 1 Platform 2 Platform 3 KNL
CPU Intel Xeon Gold 6148 Intel Xeon E5-2680 v4 Intel Xeon Gold 6148 Intel Xeon Phi7210
CPU Memory 256 GB 128 GB 768 GB 208 GB
GPU Number 8 8 16 0
GPU Nvidia V100 NVLink Nvidia V100 PCIE Nvidia V100 NVLink ?
GPU Memory 32 GB 32 GB 32 GB ?
Interconnect Infiniband FDR ? ?
OS CentOS 7.6 CentOS 7.4
Node Number 2 4 1 1
Softwares icc v2018.5.274 (except Platform 3), gcc v4.8.5, openmpi 4.0.0, cuda v10.0 (except KNL), Intel VTune v2018.4.0
Tab.2  The configurations of the experiment platforms
Fig.2  The execution time distribution of (a) class2D, (b) class3D, (c) refine3D, and break down of execution time among each subroutine of (d) M-Step, (e) E-Step when running RELION with different datasets
data hfn virus WBP
type class2D class3D refine3D class3D refine3D refine3D
data size 8.1GB 9.6GB 6.4GB 137GB
Tab.3  The details of the experiment datasets
Fig.3  
Fig.4  
Fig.5  The multi-level grouping method in M-Step (a) original grouping, (b) optimized CPU grouping, and (c) optimized CPU-GPU grouping. In this example, we have four classes and divide the slaves into four groups
Fig.6  Menon algorithm optimized with intra-group parallelization. (a) The master process of the group transposes input (from (x,y,z) to (z,x,y)) and then equally distributes input along y-axis to each slave process in the same group. Each slave process (b) applies 1D inverse FFT along z axis and transposes back to (x,y,z), (c) equally divides data along z axis and exchanges with other processes in the same group, (d) applies x-y plane 2D inverse FFT and multiplies with pre-calculated kernels, reduction and applies x-y plane forward FFT, (e) exchanges the data with each process again and transposes, (f) applies 1D forward FFT along z axis, and (g) merges the data back to master process
Fig.7  
Fig.8  The FFT schedule partitioning and data buffering method on GPU. The upper part is the inverse 3D FFT schedule partitioning and data buffering of transposed input and the lower part is the forward 3D FFT schedule partitioning and data buffering of transposed output
Fig.9  
Fig.10  
Fig.11  The CPU-GPU data exchange (a) without CPU binding and (b) with CPU binding
Fig.12  The performance distribution of multiple runs of original RELION (ori) and RELION bind with memory affinity (bind). From left to right is workload data class2D, data class3D, data refine3D, hfn, virus and WBP respectively
Fig.13  The performance improvement of RELION with the proposed optimizations
Fig.14  (a) The execution time and performance speedup of getFourierTransform in each iteration before and after applying calculation redundancy elimination (with workload data refine3D). (b) The performance improvement with the different number of processes on different datasets using MPS
Fig.15  The execution time and performance speedup of M-Step in each iteration with and without M-Step parallelization of workload (a) data refine3D, (b) virus, (c) data class3D, and (d) hfn
Fig.16  The GPU memory usage and performance of doGridding_iter on different input sizes before and after applying FFT schedule partitioning and data buffering
Fig.17  The execution time and performance speedup for each iteration with and without Symmetry acceleration with workload (a) data refine3D, (b) virus, (c) data class3D, and (d) hfn. Note that the omp parallelization is evaluated with 10 threads on CPU
Flops Bytes Operational Intensity
Select Na?(36+NR?42)+Neffect?11 Na?(6+NR?15)+Neffect?3 Na?(36+NR?42)+Neffect?11Na?(6+NR?15)+Neffect?3
Select:opt 16+Na?(22+NR?42)+Neffect?8 Na?(6+NR?15)+Neffect?3 16+Na?(22+NR?42)+Neffect?8Na?(6+NR?15)+Neffect?3
doGridding Nx?Ny?Nz?(2?log2(Nx?Ny?Nz)+11) Nx?Ny?Nz?7 2?log2(Nx?Ny?Nz)+117
doGridding div & buff Nx?Ny?Nz?(2?log2(Nx?Ny?Nz)+11) Nx?Ny?Nz?13 2?log2(Nx?Ny?Nz)+1113
Symmetry Nx?Ny?Nz?98 42?Nx?Ny?Nz+32?Nsyms 2.333
Symmetry:opt Nx?Ny?Nz?98 39?Nx?Ny?Nz+9?Nsyms 2.513
Tab.4  The flops, bytes, and operational intensity of each computation kernel in RELION
Fig.18  The roofline model of RELION running on (a) Nvidia V100 GPU, (b) Intel Xeon Gold 6148, and (c) Intel Xeon Phi 7210 (KNL)
Fig.19  The sensitivity analysis of blocking parameters (BLOCKSIZE, CFFTBLOCKSIZE) in M-Step. Each cell is the execution time of M-Step with model size of 400 and FFT size of 215
Fig.20  The execution time distribution and scalability with/without our optimizations on dataset (a) data (refine3D), (b) virus, (c) data (class3D), and (d) hfn. Each bar group shows the execution time with corresponding working slave numbers, where each bar indicates the time distribution of original (ori), CPU paralleled (par) and GPU paralleled (gpu) execution, from left to right respectively
relion-opt/s relion-3.0_beta/s speedup
data class2D 1064.61 1896 1.78
data class3D 1533.18 3100 2.02
data refine3D 1058.83 3152 2.98
hfn 2802.39 6628.079 2.37
virus 1613.39 8850.032 5.49
Tab.5  The performance comparison of our optimized RELION implementation and the latest RELION-3 beta
energy ori/kJ power ori/W energy opt/kJ power opt/W
data 3168.02 410.14 1869.07 kJ 507.73
hfn 2667.61 328.08 1341.83 408.35
virus 2431.84 312.66 690.15 464.75
WBP 1266.73 285.88 1160.37 303.84
Tab.6  The energy and power consumption of original and optimized RELION when running with the datasets data, hfn, virus and WBP
Fig.21  The power consumption when running the workloads (class2D, class3D, refine3D) with dataset data. The orange curve is the power consumption of our optimized RELION, whereas the blue curve is the power consumption of original RELION compiled with ICC
1 J Frank , B Shimkin , H Dowse . Spider—a modular software system for electron image processing. Ultramicroscopy, 1981, 6( 4): 343– 357
2 N Grigorieff . Frealign: high-resolution refinement of single particle structures. Journal of Structural Biology, 2007, 157( 1): 117– 125
3 G Tang , L Peng , P R Baldwin , D S Mann , W Jiang , I Rees , S J Ludtke . Eman2: an extensible image processing suite for electron microscopy. Journal of Structural Biology, 2007, 157( 1): 38– 46
4 D Elmlund , H Elmlund . Simple: software for ab initio reconstruction of heterogeneous single-particles. Journal of Structural Biology, 2012, 180( 3): 420– 427
5 S HW Schrers . Relion: implementation of a bayesian approach to cryo-em structure determination. Journal of Structural Biology, 2012, 180( 3): 519– 530
6 A Punjani , J L Rubinstein , D J Fleet , M A Brubaker . Cryosparc: algorithms for rapid unsupervised cryo-EM structure determination. Nature Methods, 2017, 14( 3): 290–
7 M Hu , H Yu , K Gu , Z Wang , H Ruan . A particle-filter framework for robust cryo-em 3d reconstruction. Nature Methods, 2018, 15( 12): 1083–
8 M Khoshouer , M Radjainia , W Baumeister , R Danev . Cryo-em structure of haemoglobin at 3.2 å determined with the volta phase plate. Nature Communications, 2017, 8 : 16099–
9 C Paulino , V Kalienkova , A KM Lam , Y Neldner , R Dutzler . Activation mechanism of the calciumactivated chloride channel tmem16a revealed by cryo-EM. Nature, 2017, 552( 7685): 421–
10 X Bai , C Yan , G Yang , P Lu , D Ma . An atomic structure of human γ-secretase. Nature, 2015, 525( 7568): 212–
11 R Fernandez-Leiro , S HW Scheres . A pipeline approach to single-particle processing in relion. Acta Crystallographica Section D: Structural Biology, 2017, 73( 6): 496– 502
12 Su H, Wen W, Du X, Lu X, Liao M, et al. Gerelion: Gpu-enhanced parallel implementation of single particle cryo-EM image processing. bioRxiv, 2016, 075887
13 D Kimanius , B O Forsberg , S HW Scheres , E Lindahl . Accelerated cryo-EM structure determination with parallelisation using GPUs in relion-2. Elife, 2016, 5 : e18722–
14 You X, Yang H, Luan Z, Qian D. Performance analysis and optimization of cyro-em structure determination in relion-2. In: Proceedings of Conference on Advanced Computer Architecture. 2018: 195–209
15 Relion version 2.1 stable, 2017.
16 X Li , N Grigorieff , Y Cheng . GPU-enabled frealign: accelerating single particle 3d reconstruction and refinement in fourier space on graphics processors. Journal of Structural Biology, 2010, 172( 3): 407– 412
17 Wang K, Xu S, Yu H, Fu H, Yang G. GPU-based 3d cryo-EM reconstruction with key-value streams: poster. In: Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming. 2019: 421–422
18 Wang W, Duan B, Tang W, Zhang C, Tang G, Zhang P, Sun N. A coarse-grained stream architecture for cryo-electron microscopy images 3d reconstruction. In: Proceedings of the ACM/SIGDA International Symposium on Field Programmable Gate Arrays. 2012: 143–152
19 J Zivanov , T Nakane , B O Forsberg , D Kimanius , W JH Hagen , E Lindahle , S HW Scheres . New tools for automated high-resolution cryo-EM structure determination in relion-3. Elife, 2018, 7 : e42166–
20 J G Pipe , P Menon . Sampling density compensation in mri: rationale and an iterative numerical solution. Magnetic Resonance in Medicine, 1999, 41( 1): 179– 186
21 Reinders J. VTune (TM) Performance Analyzer Essentials: Measurement and Tuning Techniques for Software Developers. 1st ed. California: Intel Press, 2004.
22 E Wang, Q Zhang, B Shen, G Zhang, X Lu, Q Wu, Y Wang. High-Performance Computing on the Intel® Xeon PhiTM. 1st ed . New York: Springer, 2014: 167– 188.
23 CUDA Nvidia. Cufft library, 2010.
24 M Frigo , S G Johnson . Fftw user’s manual. Massachusetts Institute of Technology, 1999,
25 Hursey J, Mallove E, Squyres J M, Lumsdaine A. An extensible framework for distributed testing of mpi implementations. In: Proceedings of Euro PVM/MPI. 2007.
26 S Williams , A Waterman , D Patterson . Roofline: an insightful visual performance model for multicore architectures. Communications of the ACM, 2009, 52( 4): 65– 76
27 NVIDIA. Nvidia tesla v100 performance, 2019.
28 Intel. Intel® xeon® gold 6148 processor, 2019.
29 Sodani A. Knights landing (knl): 2nd generation intel® xeon phi processor. In: Proceedings of 2015 IEEE Hot Chips 27 Symposium (HCS). 2015: 1–24
30 David H, Gorbatoy E, Hanebutte U R, Khanna R, Le C. Rapl: memory power estimation and capping. In: Proceedings of 2010 ACM/IEEE International Symposium on Low-Power Electronics and Design (ISLPED). 2010: 189–194
[1] Rong ZENG, Xiaofeng HOU, Lu ZHANG, Chao LI, Wenli ZHENG, Minyi GUO. Performance optimization for cloud computing systems in the microservice era: state-of-the-art and research opportunities[J]. Front. Comput. Sci., 2022, 16(6): 166106-.
[2] Haitao WANG, Zhanhuai LI, Xiao ZHANG, Xiaonan ZHAO, Song JIANG. WOBTree: a write-optimized B+-tree for non-volatile memory[J]. Front. Comput. Sci., 2021, 15(5): 155106-.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed