Please wait a minute...
Frontiers of Computer Science

ISSN 2095-2228

ISSN 2095-2236(Online)

CN 10-1014/TP

Postal Subscription Code 80-970

2018 Impact Factor: 1.129

Front. Comput. Sci.    2023, Vol. 17 Issue (2) : 172311    https://doi.org/10.1007/s11704-021-1018-0
RESEARCH ARTICLE
Accelerating local SGD for non-IID data using variance reduction
Xianfeng LIANG1, Shuheng SHEN2, Enhong CHEN1(), Jinchang LIU3, Qi LIU1, Yifei CHENG1, Zhen PAN1
1. Anhui Province Key Laboratory of Big Data Analysis and Application, University of Science and Technology of China, Hefei 230027, China
2. Ant Financial Services Group, Hangzhou 310000, China
3. Department of Computer Science and Engineering, Hong Kong University of Science and Technology, Hong Kong 999077, China
 Download: PDF(8321 KB)   HTML
 Export: BibTeX | EndNote | Reference Manager | ProCite | RefWorks
Abstract

Distributed stochastic gradient descent and its variants have been widely adopted in the training of machine learning models, which apply multiple workers in parallel. Among them, local-based algorithms, including LocalSGD and FedAvg, have gained much attention due to their superior properties, such as low communication cost and privacy-preserving. Nevertheless, when the data distribution on workers is non-identical, local-based algorithms would encounter a significant degradation in the convergence rate. In this paper, we propose Variance Reduced Local SGD (VRL-SGD) to deal with the heterogeneous data. Without extra communication cost, VRL-SGD can reduce the gradient variance among workers caused by the heterogeneous data, and thus it prevents local-based algorithms from slow convergence rate. Moreover, we present VRL-SGD-W with an effective warm-up mechanism for the scenarios, where the data among workers are quite diverse. Benefiting from eliminating the impact of such heterogeneous data, we theoretically prove that VRL-SGD achieves a linear iteration speedup with lower communication complexity even if workers access non-identical datasets. We conduct experiments on three machine learning tasks. The experimental results demonstrate that VRL-SGD performs impressively better than Local SGD for the heterogeneous data and VRL-SGD-W is much robust under high data variance among workers.

Keywords distributed optimization      variance reduction      local SGD      federated learning      non-IID data     
Corresponding Author(s): Enhong CHEN   
Just Accepted Date: 11 June 2021   Issue Date: 17 March 2022
 Cite this article:   
Xianfeng LIANG,Shuheng SHEN,Enhong CHEN, et al. Accelerating local SGD for non-IID data using variance reduction[J]. Front. Comput. Sci., 2023, 17(2): 172311.
 URL:  
https://academic.hep.com.cn/fcs/EN/10.1007/s11704-021-1018-0
https://academic.hep.com.cn/fcs/EN/Y2023/V17/I2/172311
Fig.1  Procedure of Local SGD with N = 2 workers and k = 2 local udaptes in the non-identical case. The local models xit (green) move towards their local optima xi? (yellow) and away from the global optima x?(red)
Reference Identical data Non-identical data Extra assumptions
SGD [20] T T NO
PR-SGD [13] O(N34T34) O(N34T34) (1)
CoCoD [14] O(N32T12) O(N34T34) (2)
VRL-SGD O(N32T12) O(N32T12) NO
Tab.1  Comparisons of the communication complexity for different algorithms. The second column and the third column show communication complexity for identical and non-identical datasets respectively. Here, we regard the following assumptions as extra assumptions: (1) an upper bound for gradients; (2) the bounded gradient variance among workers
Fig.2  
Model N b γ k Dataset n m
LeNet 8 32 0.005 20 MNIST 60,000 10
TextCNN 8 64 0.01 50 DBPedia 560,000 14
Transfer Learning 8 32 0.025 20 Tiny ImageNet 100,000 200
Tab.2  Parameters used in experiments and a summary of datasets. N denotes the number of workers, b denotes batch size on each worker, γ is the learning rate, k is the communication period, n represents the number of data samples and m represents the number of data categories
Fig.3  Epoch loss for the non-identical case. VRL-SGD and SCAFFOLD converge as fast as S-SGD, and Local SGD, EASGD converge slowly or even cannot converge. (a) LeNet, MNIST; (b) TextCNN, DBPedia; (c) Transfer Learning, Tiny ImageNet
Fig.4  Train loss in terms of communication size. VRL-SGD outperform other algorithms in terms of communication size. (a) LeNet, MNIST; (b) TextCNN, DBPedia; (c) Transfer Learning, Tiny ImageNet
Fig.5  Test accuracy in terms of communication size. VRL-SGD outperform other algorithms in terms of communication size. (a) LeNet,?MNIST; (b) TextCNN,?DBPedia; (c) Transfer?Learning,?Tiny?ImageNet
Fig.6  Epoch loss for the identical case. All of the algorithms have a similar convergence rate. (a) LeNet,?MNIST; (b) TextCNN,?DBPedia; (c) Transfer?Learning,?Tiny?ImageNet
Fig.7  Epoch loss for the non-identical case with different communication period k. (a) LeNet,?MNIST; (b) TextCNN,?DBPedia; (c) Transfer?Learning,?Tiny?ImageNet; (d) LeNet,?MNIST; (e) TextCNN,?DBPedia; (f) Transfer?Learning,?Tiny?ImageNet; (g) LeNet,?MNIST; (h) Transfer?Learning,?Tiny?ImageNet; (i) TextCNN,?DBPedia
Fig.8  Logarithm of distance to the global optima for different b and communication period k. (a) b=10, k=4; (b) b=10, k=16; (c) b=10, k=64; (d) b=100, k=4; (e) b=100, k=16; (f) b=100, k=64; (g) b=1000, k=4; (h) b=1000, k=16; (i) b=1000,k=64
  
  
  
  
  
  
  
  Fig.A1 The iteration speedup of VRL-SGD by varying the number of workers N
1 H Robbins , S Monro . A stochastic approximation method. The Annals of Mathematical Statistics, 1951, 22( 3): 400– 407
2 K He, X Zhang, S Ren, J Sun. Deep residual learning for image recognition. In: Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. 2016, 770– 778
3 Chen J, Liu D, Xu T, Wu S, Cheng Y, Chen E. Is heuristic sampling necessary in training deep object detectors? 2019, arXiv preprint arXiv: 1909.04868
4 J Devlin, M W Chang, K Lee, K Toutanova. BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2019
5 H T Cheng, L Koc, J Harmsen, T Shaked, T Chandra, H Aradhye, G Anderson, G Corrado, W Chai, M Ispir, R Anil, Z Haque, L Hong, V Jain, X Liu, H Shah. Wide & deep learning for recommender systems. In: Proceedings of the 1st Workshop on Deep Learning for Recommender Systems. 2016, 7– 10
6 Q Liu , Z Huang , Y Yin , E Chen , H Xiong , Y Su , G Hu . EKT: exercise-aware knowledge tracing for student performance prediction. IEEE Transactions on Knowledge and Data Engineering, 2021, 33( 1): 100– 115
7 L Wu , Z Li , H Zhao , Z Pan , Q Liu , E Chen . Estimating early fundraising performance of innovations via graph-based market environment model. Proceedings of the AAAI Conference on Artificial Intelligence, 2020, 34( 4): 6396– 6403
8 Y Liu , Q Liu , H Zhao , Z Pan , C Liu . Adaptive quantitative trading: an imitative deep reinforcement learning approach. Proceedings of the AAAI Conference on Artificial Intelligence, 2020, 34( 2): 2128– 2135
9 J Wang , H Zhang , Q Liu , Z Pan , H Tao . Crowdfunding dynamics tracking: a reinforcement learning approach. Proceedings of the AAAI Conference on Artificial Intelligence, 2020, 34( 4): 6210– 6218
10 J Wang, G Joshi. Cooperative SGD: a unified framework for the design and analysis of communication-efficient SGD algorithms. 2019, arXiv preprint arXiv: 1808.07576
11 F Zhou, G Cong. On the convergence properties of a K-step averaging stochastic gradient descent algorithm for nonconvex optimization. In: Proceedings of the 27th International Joint Conference on Artificial Intelligence. 2018, 3219−3227
12 S U Stich. Local SGD converges fast and communicates little. In: Proceedings of 2019 International Conference on Learning Representations. 2019
13 H Yu , S Yang , S Zhu . Parallel restarted SGD with faster convergence and less communication: demystifying why model averaging works for deep learning. Proceedings of the AAAI Conference on Artificial Intelligence, 2019, 33 : 5693– 5700
14 Shen S, Xu L, Liu J, Liang X, Cheng Y. Faster distributed deep net training: computation and communication decoupled stochastic gradient descent. In: Proceedings of the 28th International Joint Conference on Artificial Intelligence. 2019, 4582−4589
15 H B McMahan, E Moore, D Ramage, S Hampson, B A Arcas. Communication-efficient learning of deep networks from decentralized data. In: Proceedings of the 20th International Conference on Artificial Intelligence and Statistics. 2017, 1273−1282
16 J Konečný, H B McMahan, F X Yu, A T Suresh, D Bacon, P Richtárik. Federated learning: strategies for improving communication efficiency. In: Proceedings of the ICLR 2018. 2018
17 T Li , A K Sahu , A Talwalkar , V Smith . Federated learning: challenges, methods, and future directions. IEEE Signal Processing Magazine, 2020, 37( 3): 50– 60
18 Kairouz P, McMahan H B, Avent B, Bellet A, Bennis M, et al. Advances and Open Problems in Federated Learning. Foundations and Trends ® in Machine Learning, 2021, 14(1−2): 1−210
19 J Yang , Y Duan , T Qiao , H Zhou , J Wang , W Zhao . Prototyping federated learning on edge computing systems. Frontiers of Computer Science, 2020, 14 : 146318–
20 S Ghadimi , G Lan . Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 2013, 23( 4): 2341– 2368
21 O Dekel , R Gilad-Bachrach , O Shamir , L Xiao . Optimal distributed online prediction using mini-batches. The Journal of Machine Learning Research, 2012, 13 : 165– 202
22 D Alistarh, D Grubic, J Li, R Tomioka, M Vojnovic. QSGD: communication-efficient SGD via gradient quantization and encoding. In: Proceedings of the Advances in Neural Information Processing Systems. 2017, 1709−1720
23 A F Aji, K Heafield. Sparse communication for distributed gradient descent. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 2017, 440– 445
24 J Bernstein, J Zhao, K Azizzadenesheli, A Anandkumar. SignSGD with majority vote is communication efficient and fault tolerant. In: Proceedings of the International Conference on Learning Representations. 2019
25 Lin Y, Han S, Mao H, Wang Y, Dally W J. Deep gradient compression: reducing the communication bandwidth for distributed training. In: Proceedings of the International Conference on Learning Representations. 2018
26 S P Karimireddy, Q Rebjock, S U Stich, M Jaggi. Error feedback fixes SignSGD and other gradient compression schemes. In: Proceedings of the 36th International Conference on Machine Learning. 2019, 3252−3261
27 H Tang, C Yu, X Lian, T Zhang, J Liu. DoubleSqueeze: parallel stochastic gradient descent with double-pass error-compensated compression. In: Proceedings of the 36th International Conference on Machine Learning. 2019, 6155−6165
28 D Povey, X Zhang, S Khudanpur. Parallel training of DNNs with natural gradient and parameter averaging. 2015, arXiv preprint arXiv: 1410.7455
29 H Su, H Chen. Experiments on parallel training of deep neural network using model averaging. 2018, arXiv preprint arXiv: 1507.01239
30 T Lin, S U Stich, K K Patel, M Jaggi. Don’t use large mini-batches, use local SGD. In: Proceedings of the 8th International Conference on Learning Representations. 2020
31 H Yu, R Jin, S Yang. On the linear speedup analysis of communication efficient momentum SGD for distributed non-convex optimization. In: Proceedings of the 36th International Conference on Machine Learning. 2019, 7184−7193
32 F Haddadpour, M M Kamani, M Mahdavi, V Cadambe. Trading redundancy for communication: Speeding up distributed SGD for non-convex optimization. In: Proceedings of the 36th International Conference on Machine Learning. 2019, 2545−2554
33 X Li, K Huang, W Yang, S Wang, Z Zhang. On the convergence of FedAvg on non-IID data. In: Proceedings of the 8th International Conference on Learning Representations. 2020
34 Khaled A, Mishchenko K, Richtarik P. Tighter theory for local SGD on identical and heterogeneous data. In: Proceedings of the 23rd International Conference on Artificial Intelligence and Statistics. 2020, 4519−4529
35 R Johnson, T Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In: Proceedings of the 27th Annual Conference on Neural Information Processing Systems. 2013, 315– 323
36 L Zhang, M Mahdavi, R Jin. Linear convergence with condition number independent access of full gradients. In: Proceedings of the 26th International Conference on Neural Information Processing Systems. 2013, 980– 988
37 A Defazio, F Bach, S Lacoste-Julien. SAGA: a fast incremental gradient method with support for non-strongly convex composite objectives. In: Proceedings of the 27th International Conference on Neural Information Processing Systems. 2014, 1646−1654
38 L M Nguyen, J Liu, K Scheinberg, M Takáč. SARAH: a novel method for machine learning problems using stochastic recursive gradient. In: Proceedings of the 34th International Conference on Machine Learning. 2017, 2613−2621
39 W Shi , Q Ling , G Wu , W Yin . EXTRA: an exact first-order algorithm for decentralized consensus optimization. SIAM Journal on Optimization, 2015, 25( 2): 944– 966
40 A Mokhtari , A Ribeiro . DSA: decentralized double stochastic averaging gradient algorithm. The Journal of Machine Learning Research, 2016, 17( 1): 2165– 2199
41 H Tang, X Lian, M Yan, C Zhang, J Liu. D2: decentralized training over decentralized data . In: Proceedings of the 35th International Conference on Machine Learning. 2018, 4855−4863
42 S P Karimireddy, S Kale, M Mohri, S Reddi, S Stich, A T Suresh. SCAFFOLD: stochastic controlled averaging for federated learning. In: Proceedings of the 37th International Conference on Machine Learning. 2020, 5132−5143
43 A Paszke, S Gross, S Chintala, G Chanan, E Yang, Z DeVito, Z Lin, A Desmaison, L Antiga, A Lerer. Automatic differentiation in PyTorch. In: Proceedings of the 31st Conference on Neural Information Processing Systems. 2017
44 S Zhang, A Choromanska, Y LeCun. Deep learning with elastic averaging SGD. In: Proceedings of the 28th International Conference on Neural Information Processing Systems. 2015, 685– 693
45 A El-Sawy, H El-Bakry, M Loey. CNN for handwritten Arabic digits recognition based on LeNet-5. In: Proceedings of the International Conference on Advanced Intelligent Systems and Informatics. 2016, 566– 575
46 LeCun Y. The MNIST database of handwritten digits. See Yann.lecun.com/exdb/mnist/ website, 1998
47 Y Kim. Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. 2014
48 J Lehmann , R Isele , M Jakob , A Jentzsch , D Kontokostas , P N Mendes , S Hellmann , M Morsey , P van Kleef , S Auer , C Bizer . DBpedia−A large-scale, multilingual knowledge base extracted from wikipedia. Semantic Web, 2015, 6( 2): 167– 195
49 J Deng, W Dong, R Socher, L J Li, K Li, L Fei-Fei. ImageNet: a large-scale hierarchical image database. In: Proceedings of 2009 IEEE Conference on Computer Vision and Pattern Recognition. 2009, 248– 255
50 Moschitti A, Pang B, Daelemans W. Glove: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2014, 1532−1543
51 C Szegedy, V Vanhoucke, S Ioffe, J Shlens, Z Wojna. Rethinking the inception architecture for computer vision. In: Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. 2016, 2818−2826
52 S Ioffe, C Szegedy. Batch normalization: accelerating deep network training by reducing internal covariate shift. In: Proceedings of the 32nd International Conference on International Conference on Machine Learning. 2015, 448– 456
[1] Shiwei LU, Ruihu LI, Wenbin LIU. FedDAA: a robust federated learning framework to protect privacy and defend against adversarial attack[J]. Front. Comput. Sci., 2024, 18(2): 182307-.
[2] Kaiyue ZHANG, Xuan SONG, Chenhan ZHANG, Shui YU. Challenges and future directions of secure federated learning: a survey[J]. Front. Comput. Sci., 2022, 16(5): 165817-.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed