New logarithmic step size for stochastic gradient descent

doi:10.1007/s11704-023-3245-z

Frontiers of Computer Science

2025, Vol. 19

Issue (1): 191301 https://doi.org/10.1007/s11704-023-3245-z

本期目录

New logarithmic step size for stochastic gradient descent

Mahsa Soheil SHAMAEE¹, Sajad Fathi HAFSHEJANI²(

), Zeinab SAEIDIAN³

¹. Department of Computer Science, Faculty of Mathematical Science, University of Kashan, Kashan 87317-53153, Iran
². Department of Applied Mathematics, Shiraz University of Technology, Shiraz 13876-71557, Iran
³. Department of Mathematical Sciences, University of Kashan, Kashan 87317-53153, Iran

全文: PDF(5696 KB) HTML

Abstract：

In this paper, we propose a novel warm restart technique using a new logarithmic step size for the stochastic gradient descent (SGD) approach. For smooth and non-convex functions, we establish an $O (1 T)$ convergence rate for the SGD. We conduct a comprehensive implementation to demonstrate the efficiency of the newly proposed step size on the FashionMinst, CIFAR10, and CIFAR100 datasets. Moreover, we compare our results with nine other existing approaches and demonstrate that the new logarithmic step size improves test accuracy by 0.9% for the CIFAR100 dataset when we utilize a convolutional neural network (CNN) model.

Key words： stochastic gradient descent logarithmic step size warm restart technique

收稿日期: 2023-03-25 出版日期: 2024-03-14

Corresponding Author(s): Sajad Fathi HAFSHEJANI

引用本文:

. [J]. Frontiers of Computer Science, 2025, 19(1): 191301.
Mahsa Soheil SHAMAEE, Sajad Fathi HAFSHEJANI, Zeinab SAEIDIAN. New logarithmic step size for stochastic gradient descent. Front. Comput. Sci., 2025, 19(1): 191301.

链接本文:

https://academic.hep.com.cn/fcs/CN/10.1007/s11704-023-3245-z
https://academic.hep.com.cn/fcs/CN/Y2025/V19/I1/191301

Step sizes	Conditions	Convergence rate	Reference
$η t = η 0 2 (1 + cos ? (π t T))$	$η 0 = 1 L c, c = O (T)$	$O (1 T)$	[18]
$η t = η 0 (β T) t T$	$η 0 = 1 L c, c = O (T)$	$O (1 T)$	[18]
$η t = η 0 (β T) t T$	$β = O (T)$	$O (ln ? T T)$	[20]
$η t$ =Constant	$η 0 = 1 T$	$O (1 T)$	[15]
New proposed method	$η 0 = 1 L c, c = O (T ln ? T)$	$O (1 T)$	New

Tab.1

Fig.1

Methods	FashionMNIST			CIFAR10			CIFAR100
Methods	$η 0$	$α$	$c$	$η 0$	$α$	$c$	$η 0$	$α$	$c$
Constant step size	$0.007$	?	?	$0.07$	?	?	$0.07$	?	?
$O (1 t)$ step size	$0.05$	$0.00038$	?	$0.1$	$0.00023$	?	$0.8$	$0.004$	?
$O (1 t)$ step size	$0.05$	$0.00653$	?	$0.2$	$0.07907$	?	$0.1$	$0.015$	?
Adam	$0.0009$	?	?	$0.0009$	?	?	$0.0009$	?	?
SGD+Armijo	$0.5$	?	$0.1$	$2.5$	?	$0.1$	$5$	?	$0.5$
ReduceLROnPlateau	$0.04$	$0.5$	?	$0.07$	$0.1$	?	$0.1$	$0.5$	?
Stagewise - 1 Milestone	$0.04$	$0.1$	?	$0.1$	$0.1$	?	$0.07$	$0.1$	?
Stagewise - 2 Milestones	$0.04$	$0.1$	?	$0.2$	$0.1$	?	$0.07$	$0.1$	?
Cosine step size	$0.05$	?	?	$0.25$	?	?	$0.09$	?	?
Ln step size	$0.05$	?	?	$0.25$	$?$	?	$0.53$	$?$	?

Tab.2

Fig.2

Fig.3

Fig.4

Fig.5

Methods	Training loss	Test accuracy
Constant Step Size	$0.0007 ± 0.0003$	$0.9299 ± 0.0016$
$O (1 t)$ Step Size	$0.0006 ± 0.0000$	$0.9311 ± 0.0012$
$O (1 t)$ Step Size	$0.0011 ± 0.0003$	$0.9275 ± 0.0007$
Adam	$0.0131 ± 0.0017$	$0.9166 ± 0.0019$
SGD+Armijo	6.73e?05 ± 0.0000	$0.9277 ± 0.0012$
ReduceLROnPlateau	$0.0229 ± 0.0032$	$0.9299 ± 0.0014$
Stagewise - 1 Milestone	$0.0004 ± 0.0000$	$0.9303 ± 0.0007$
Stagewise - 2 Milestones	$0.0015 ± 0.0000$	$0.9298 ± 0.0014$
Cosine step size	$0.0004 ± 1.1 e ? 05$	$0.9284 ± 0.0005$
Ln step size	$0.0007 ± 0.0000$	0.9314 ± 0.0018

Tab.3

Methods	Training loss	Test accuracy
Constant Step Size	$0.1870 ± 0.0257$	$0.8776 ± 0.0060$
$O (1 t)$ Step Size	$0.0261 ± 0.0021$	$0.8946 ± 0.0020$
$O (1 t)$ Step Size	$0.2634 ± 0.3493$	$0.8747 ± 0.0077$
Adam	$0.1140 ± 0.0060$	$0.8839 ± 0.0031$
SGD+Armijo	$0.0463 ± 0.0169$	$0.8834 ± 0.0059$
ReduceLROnPlateau	$0.0646 ± 0.0041$	$0.9081 ± 0.0019$
Stagewise - 1 Milestone	$0.0219 ± 0.0012$	$0.9111 ± 0.0034$
Stagewise - 2 Milestones	$0.0344 ± 0.0062$	$0.9151 ± 0.0039$
Cosine step size	$0.0102 ± 0.0007$	0.9201 $±$ 0.0017
Ln step size	0.0089 ± 0.0011	$0.9012 ± 0.0027$

Tab.4

Methods	Training loss	Test accuracy
Constant Step Size	$1.19599 ± 0.0183$	$0.5579 ± 0.0039$
$O (1 t)$ Step Size	$1.43381 ± 0305$	$0.54436 ± 0.0038$
$O (1 t)$ Step Size	$1.14579 ± 0.0125$	$0.5624 ± 0.0020$
Adam	$1.53433 ± 0.0180$	$0.5206 ± 0.0053$
SGD+Armijo	$1.3046 ± 0.0476$	$0.5244 ± 0.0053$
ReduceLROnPlateau	$1.6000 ± 0.055$	$0.5148 ± 0.1929$
Stagewise - 1 Milestone	$1.1889 ± 0.0077$	$0.5706 ± 0.0049$
Stagewise - 2 Milestones	$1.09061 ± 0.0040$	$0.5779 ± 0.0017$
Cosine step size	$1.12694 ± 0.0081$	$0.5739 ± 0.0025$
Ln step size	1.0805 ± 0.0327	0.5869 ± 0.0035

Tab.5

Methods	Training loss	Test accuracy
Constant Step Size	$1.0789 ± 0.06336$	$0.6089 ± 0.01485$
$O (1 t)$ Step Size	$0.2709 ± 0.02047$	$0.6885 ± 0.0051$
$O (1 t)$ Step Size	$0.7584 ± 0.04068$	$0.6411 ± 0.00469$
Adam	$0.6008 ± 0.02396$	$0.6555 ± 0.00430$
SGD+Armijo	$0.1162 ± 0.01175$	$0.6869 ± 0.00460$
ReduceLROnPlateau	$0.0863 ± 0.00896$	$0.7440 ± 0.00695$
Stagewise - 1 Milestone	$0.15300 ± 0.00382$	$0.7456 ± 0.00188$
Stagewise - 2 Milestones	$0.21032 ± 0.01067$	$0.73102 ± 0.00402$
Cosine step size	$0.0622 ± 0.00349$	$0.7568 ± 0.00311$
Ln step size	0.06011 ± 0.0018	0.7579 ± 0.0022

Tab.6

1	H, Robbins S Monro . A stochastic approximation method. The Annals of Mathematical Statistics, 1951, 22( 3): 400–407
2	A, Krizhevsky I, Sutskever G E Hinton . ImageNet classification with deep convolutional neural networks. Communications of the ACM, 2017, 60( 6): 84–90
3	A, Krizhevsky G Hinton . Learning multiple layers of features from tiny images. Toronto: University of Toronto, Department of Computer Science, 2009
4	Redmon J, Farhadi A. Yolo9000: better, faster, stronger. In: Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. 2017, 6517−6525
5	J, Zhang C Zong . Deep neural networks in machine translation: an overview. IEEE Intelligent Systems, 2015, 30( 5): 16–25
6	Mishra P, Sarawadekar K. Polynomial learning rate policy with warm restart for deep neural network. In: Proceedings of 2019 IEEE Region 10 Conference (TENCON). 2019, 2087−2092
7	S, Vaswani A, Mishkin I, Laradji M, Schmidt G, Gidel S Lacoste-Julien . Painless stochastic gradient: Interpolation, line-search, and convergence rates. In: Proceedings of the 33rd International Conference on Neural Information Processing Systems. 2019, 335
8	R M, Gower N, Loizou X, Qian A, Sailanbayev E, Shulgin P Richtárik . SGD: General analysis and improved rates. In: Proceedings of the 36th International Conference on Machine Learning. 2019, 5200−5209
9	Huang G, Liu Z, Van Der Maaten L, Weinberger K Q. Densely connected convolutional networks. In: Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. 2017, 2261–2269
10	L N Smith . Cyclical learning rates for training neural networks. In: Proceedings of 2017 IEEE Winter Conference on Applications of Computer Vision (WACV). 2017, 464−472
11	Loshchilov I, Hutter F. SGDR: Stochastic gradient descent with warm restarts. In: Proceedings of the 5th International Conference on Learning Representations. 2017
12	G, Vrbančič V Podgorelec . Efficient ensemble for image-based identification of pneumonia utilizing deep CNN and SGD with warm restarts. Expert Systems with Applications, 2022, 187: 115834
13	Xu G, Cao H, Dong Y, Yue C, Zou Y. Stochastic gradient descent with step cosine warm restarts for pathological lymph node image classification via PET/CT images. In: Proceedings of the 5th IEEE International Conference on Signal and Image Processing (ICSIP). 2020, 490−493
14	A, Nemirovski A, Juditsky G, Lan A Shapiro . Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization, 2009, 19( 4): 1574–1609
15	S, Ghadimi G H Lan . Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 2013, 23( 4): 2341–2368
16	F, Bach E Moulines . Non-asymptotic analysis of stochastic approximation algorithms for machine learning. In: Proceedings of the 24th International Conference on Neural Information Processing Systems. 2011, 451−459
17	A, Rakhlin O, Shamir K Sridharan . Making gradient descent optimal for strongly convex stochastic optimization. In: Proceedings of the 29th International Conference on International Conference on Machine Learning, 2011
18	X, Li Z, Zhuang F Orabona . A second look at exponential and cosine step sizes: Simplicity, adaptivity, and performance. In: Proceedings of the 38th International Conference on Machine Learning. 2021, 6553−6564
19	R, Ge S M, Kakade R, Kidambi P Netrapalli . The step decay schedule: A near optimal, geometrically decaying learning rate procedure for least squares. In: Proceedings of the 33rd International Conference on Neural Information Processing Systems, 2019, 1341
20	Wang X, Magnússon S, Johansson M. On the convergence of step decay step-size for stochastic optimization. In: Proceedings of the 35th Conference on Neural Information Processing Systems, 2021, 14226−14238
21	J, Nocedal S J Wright . Numerical Optimization. New York: Springer, 1999
22	He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In: Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. 2016, 770−778
23	Kingma D P, Ba J. Adam: a method for stochastic optimization. In: Proceedings of the 3rd International Conference on Learning Representations. 2015
24	N S, Aybat A, Fallah M, Gurbuzbalaban A Ozdaglar . A universally optimal multistage accelerated stochastic gradient method. In: Proceedings of the 33rd International Conference on Neural Information Processing Systems. 2019, 765
25	G B, Thomas R L, Finney M D, Weir F R Giordano . Thomas’ Calculus, Early Transcendentals. 10th ed. Boston: Addison Wesley, 2002

[1]

FCS-23245-OF-MSS_suppl_1

Download

Viewed

Full text

Abstract

Cited

Shared

Discussed