Please wait a minute...
Frontiers of Computer Science

ISSN 2095-2228

ISSN 2095-2236(Online)

CN 10-1014/TP

Postal Subscription Code 80-970

2018 Impact Factor: 1.129

Front. Comput. Sci.    2022, Vol. 16 Issue (3) : 163313    https://doi.org/10.1007/s11704-020-0298-0
RESEARCH ARTICLE
On the learning dynamics of two-layer quadratic neural networks for understanding deep learning
Zhenghao TAN1,2, Songcan CHEN1,2()
1. College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing 211106, China
2. College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, MIIT Key Laboratory of Pattern Analysis and Machine Intelligence, Nanjing 211106, China
 Download: PDF(2535 KB)   HTML
 Export: BibTeX | EndNote | Reference Manager | ProCite | RefWorks
Abstract

Deep learning performs as a powerful paradigm in many real-world applications; however, its mechanism remains much of a mystery. To gain insights about nonlinear hierarchical deep networks, we theoretically describe the coupled nonlinear learning dynamic of the two-layer neural network with quadratic activations, extending existing results from the linear case. The quadratic activation, although rarely used in practice, shares convexity with the widely used ReLU activation, thus producing similar dynamics. In this work, we focus on the case of a canonical regression problem under the standard normal distribution and use a coupled dynamical system to mimic the gradient descent method in the sense of a continuous-time limit, then use the high order moment tensor of the normal distribution to simplify these ordinary differential equations. The simplified system yields unexpected fixed points. The existence of these non-global-optimal stable points leads to the existence of saddle points in the loss surface of the quadratic networks. Our analysis shows there are conserved quantities during the training of the quadratic networks. Such quantities might result in a failed learning process if the network is initialized improperly. Finally, We illustrate the comparison between the numerical learning curves and the theoretical one, which reveals the two alternately appearing stages of the learning process.

Keywords learning dynamic      quadratic network      ordinary differential equations     
Corresponding Author(s): Songcan CHEN   
Just Accepted Date: 24 September 2020   Issue Date: 09 November 2021
 Cite this article:   
Zhenghao TAN,Songcan CHEN. On the learning dynamics of two-layer quadratic neural networks for understanding deep learning[J]. Front. Comput. Sci., 2022, 16(3): 163313.
 URL:  
https://academic.hep.com.cn/fcs/EN/10.1007/s11704-020-0298-0
https://academic.hep.com.cn/fcs/EN/Y2022/V16/I3/163313
Fig.1  A comparison between quadratic activation and ReLU activation. (a) Comparison of activations; (b) Comparison of loss curves
Fig.2  A typical vector field (black) with stable manifolds(red) of dynamic Eq.(9) and hyperbolas (blue) d(w12?2w22)=0
Fig.3  The learning curve of quadratic neural network with L>D. (a) Network with D = 6; (b) Network with D = 10
Fig.4  The learning curve of quadratic neural network with LD
1 Collobert R, Weston J. A unified architecture for natural language processing: deep neural networks with multitask learning. In: Proceedings of International Conference on Machine Learning. 2008, 160–167
2 Krizhevsky A, Sutskever I, Hinton G E. ImageNet classification with deep convolutional neural networks. In: Proceedings of Advances in Neural Information Processing Systems. 2012, 1106–1114
3 He K M, Zhang X Y, Ren S Q, Sun J. Deep residual learning for image recognition. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 2016, 770–778
4 D Silver , J Schrittwieser , K Simonyan , I Antonoglou , A Huang , A Guez , T Hubert , L Baker , M Lai , A Bolton . Mastering the game of go without human knowledge. Nature, 2017, 550( 7676): 354– 359
5 Li H, Xu Z, Taylor G, Studer C, Goldstein T. Visualizing the loss landscape of neural nets. In: Proceedings of Advances in Neural Information Processing Systems. 2018, 6391–6401
6 S Hochreiter , J Schmidhuber . Flat minima. Neural Computation, 1997, 9( 1): 1– 42
7 Yu B, Zhang J Z, Zhu Z X. On the learning dynamics of two-layer nonlinear convolutional neural networks. 2019, arXiv preprint arXiv:1905.10157
8 F L Fan , J J Xiong , G Wang . Universal approximation with quadratic deep networks. Neural Networks, 2020, 124 : 383– 392
9 Livni R, Shalev-Shwartz S, Shamir O. On the computational efficiency of training neural networks. In: Proceedings of Advances in Neural Information Processing Systems. 2014, 855–863
10 Soltani M, Hegde C. Towards provable learning of polynomial neural networks using low-rank matrix estimation. In: Proceedings of International Conference on Artificial Intelligence and Statistics. 2018, 1417–1426
11 P Baldi , K Hornik . Neural networks and principal component analysis: learning from examples without local minima. Neural Networks, 1989, 2( 1): 53– 58
12 Saxe A M, McClelland J L, Ganguli S. Learning hierarchical categories in deep neural networks. In: Proceedings of Annual Meeting of the Cognitive Science Society. 2013, 35(35)
13 Saxe A M, McClelland J L, Ganguli S. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. In: Proceedings of International Conference on Learning Representations. 2014
14 M W Jian , K M Lam , J Y Dong . Illumination-insensitive texture discrimination based on illumination compensation and enhancement. Information Sciences, 2014, 269 : 60– 72
15 M W Jian , Y L Yin , J Y Dong , W Y Zhang . Comprehensive assessment of non-uniform illumination for 3D heightmap reconstruction in outdoor environments. Computers in Industry, 2018, 99 : 110– 118
16 T M Heskes , B Kappen . On-line learning processes in artificial neural networks. North-Holland Mathematical Library, 1993, 51 : 199– 233
17 Zhang C Y, Bengio S, Hardt M, Recht B, Vinyals O. Understanding deep learning requires rethinking generalization. In: Proceedings of International Conference on Learning Representations. 2017
18 Kleinberg R, Li Y Z, Yuan Y. An alternative view: when does SGD escape local minima? In: Proceedings of International Conference on Machine Learning. 2018, 2703-2712
19 Advani M S, Saxe A M. High-dimensional dynamics of generalization error in neural networks. 2017, arXiv preprint arXiv:1710.03667
20 Neyshabur B, Tomioka R, Srebro N. In search of the real inductive bias: on the role of implicit regularization in deep learning. In: Proceedings of International Conference on Learning Representations. 2015
21 Pérez G V, Camargo C Q, Louis A A. Deep learning generalizes because the parameter-function map is biased towards simple functions. In: Proceedings of International Conference on Learning Representations. 2019
22 Du S S, Lee J D. On the power of over-parametrization in neural networks with quadratic activation. In: Proceedings of International Conference on Machine Learning. 2018, 1328-1337
23 Gamarnik D, Kizildag E C, Zadik I. Stationary points of shallow neural networks with quadratic activation function. 2019, arXiv preprint arXiv: 1912.01599
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed