Please wait a minute...
Frontiers of Computer Science

ISSN 2095-2228

ISSN 2095-2236(Online)

CN 10-1014/TP

Postal Subscription Code 80-970

2018 Impact Factor: 1.129

Front. Comput. Sci.    2022, Vol. 16 Issue (6) : 166334    https://doi.org/10.1007/s11704-021-0236-9
RESEARCH ARTICLE
ResLNet: deep residual LSTM network with longer input for action recognition
Tian WANG1, Jiakun LI2, Huai-Ning WU2, Ce LI3, Hichem SNOUSSI4, Yang WU5()
1. Institute of Artificial Intelligence, Beihang University, Beijing 100191, China
2. School of Automation Science and Electrical Engineering, Beihang University, Beijing 100191, China
3. College of Electrical and Information Engineering, Lanzhou University of Technology, Lanzhou 730050, China
4. Institute Charles Delaunay-LM2S FRE CNRS 2019, University of Technology of Troyes, Troyes 10010, France
5. Institute for Research Initiatives, Nara Institute of Science and Technology, Nara 630-0192, Japan
 Download: PDF(10754 KB)   HTML
 Export: BibTeX | EndNote | Reference Manager | ProCite | RefWorks
Abstract

Action recognition is an important research topic in video analysis that remains very challenging. Effective recognition relies on learning a good representation of both spatial information (for appearance) and temporal information (for motion). These two kinds of information are highly correlated but have quite different properties, leading to unsatisfying results of both connecting independent models (e.g., CNN-LSTM) and direct unbiased co-modeling (e.g., 3DCNN). Besides, a long-lasting tradition on this task with deep learning models is to just use 8 or 16 consecutive frames as input, making it hard to extract discriminative motion features. In this work, we propose a novel network structure called ResLNet (Deep Residual LSTM network), which can take longer inputs (e.g., of 64 frames) and have convolutions collaborate with LSTM more effectively under the residual structure to learn better spatial-temporal representations than ever without the cost of extra computations with the proposed embedded variable stride convolution. The superiority of this proposal and its ablation study are shown on the three most popular benchmark datasets: Kinetics, HMDB51, and UCF101. The proposed network could be adopted for various features, such as RGB and optical flow. Due to the limitation of the computation power of our experiment equipment and the real-time requirement, the proposed network is tested on the RGB only and shows great performance.

Keywords action recognition      deep learning      neural network     
Corresponding Author(s): Yang WU   
Just Accepted Date: 25 May 2021   Issue Date: 12 January 2022
 Cite this article:   
Tian WANG,Jiakun LI,Huai-Ning WU, et al. ResLNet: deep residual LSTM network with longer input for action recognition[J]. Front. Comput. Sci., 2022, 16(6): 166334.
 URL:  
https://academic.hep.com.cn/fcs/EN/10.1007/s11704-021-0236-9
https://academic.hep.com.cn/fcs/EN/Y2022/V16/I6/166334
Fig.1  These two clips are from the UCF101 dataset. High jump and long jump can not be distinguished by the first 16 frames, or even considered running, which is a reasonable but incorrect recognition result. And continuous 64 frames contain enough action information
Fig.2  The proposed ResLNet, which is based on ResNet, has three novel components: variable stride convolution (V), residual LSTM block (L) and ConvLSTM with batch normalization (B)
Fig.3  Details of residual block. BN refers to batch normalization layer, which is applied after every convolution layer. Considering that the input and output tensor of a block may have a different number of channels, the input of the block and output of convolution cannot be added directly by shortcuts. So that a convolution layer with a 1×1×1 kernel is employed in shortcut to increase the number of channels. (a) Normal; (b) convolutional shortcut
Fig.4  Residual LSTM block. Compared with the residual block, the first convolution layer is replaced by a ConvLSTM structure and the second convolution layer only performs spatial convolution. 3×3 refers to the kernel size and batch normalization is employed as Eq. (3)
Fig.5  The comparison between fixed strides and variable strides. In sub-figure (a), the operations corresponding to dashed lines is unnecessary when a large stride is used. In sub-figure (b), The variable strides convolution is designed to avoid unnecessary computations. The time and memory cost of variable strides is 60% of that of fixed strides
Layer Output size Filter Stride
Input 64×112×112 None None
Conv1 13×56×56 3×7×7,323×3×3,32 variable, 2, 2
Conv2 13×56×56 [3×3×3,643×3×3,64L,3×3,481×3×3,64] 1, 1, 1
Conv3 13×28×28 [3×3×3,1283×3×3,128L,3×3,961×3×3,128] 1, 2, 2
Conv4 13×14×14 [3×3×3,2563×3×3,256L,3×3,1921×3×3,256] 1, 2, 2
Conv5 13×7×7 [3×3×3,5123×3×3,512L,3×3,3841×3×3,512] 1, 2, 2
Average pooling 1×1×1 13×7×7 1, 1, 1
Tab.1  Architecture of ResLNet. In each Conv, the stride of the first convolution layer of the first block is shown and all the others is “1, 1, 1”. L refers to the ConvLSTM structure and its kernel size is 3×3. At the end of the network, a global average pooling layer is employed
Model Input size Accuracy
CNN-LSTM (2017) [31] 25×112×112 57.0
C3D (2017) [31] 16×112×112 56.1
3D-ResNet-18 (2018) [4] 16×112×112 54.2
ResLNet (L+B) 16×112×112 58.1
ResLNet (V) 64×112×112 60.2
ResLNet (V+L+B) 64×112×112 62.4
Tab.2  Results on the validation set of Kinetics. For a fair comparison, all the models take only the RGB-data as their input
Model HMDB51 UCF101
CNN-LSTM (2017) [31] 43.9 84.3
C3D (2017) [31] 24.3 51.6
P3D (2017) [5] ? 88.6
3D-ResNet-18 (2018) [4] 56.4 84.4
LTC (2018) [41] ? 82.4
DEM (2019) [42] 52.6 83.7
ML-HDP (2019) [43] ? 89.3
ResLNet (V) 59.7 87.3
ResLNet (V+L) 55.5 84.0
ResLNet (L+B) 62.4 89.7
ResLNet (V+L+B) 63.1 90.5
Tab.3  Results on HMDB51 and UCF101. All the accuracy values are averaged over three splits. We only list the results that are based on the RGB-data only
1 C Szegedy, W Liu, Y Jia, P Sermanet, S Reed, D Anguelov, D Erhan, V Vanhoucke, A Rabinovich. Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015, 1– 9
2 K He, X Zhang, S Ren, J Sun. Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016, 770– 778
3 G Huang, Z Liu, L Van Der Maaten, K Q Weinberger. Densely connected convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017, 4700−4708
4 Hara K, Kataoka H, Satoh Y. Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet? In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018, 6546−6555
5 Z Qiu, T Yao, T Mei. Learning spatio-temporal representation with pseudo-3d residual networks. In: Proceedings of the IEEE International Conference on Computer Vision. 2017, 5533−5541
6 D Tran, H Wang, L Torresani, J Ray, Y LeCun, M Paluri. A closer look at spatiotemporal convolutions for action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018, 6450−6459
7 K Hara, H Kataoka, Y Satoh. Learning spatio-temporal features with 3d residual networks for action recognition. In: Proceedings of the IEEE International Conference on Computer Vision. 2017, 3154−3160
8 D Tran, J Ray, Z Shou, S F Chang, M Paluri. Convnet architecture search for spatiotemporal feature learning. 2017, arXiv preprint arXiv: 1708.05038
9 M Ye , J Li , A J Ma , L Zheng , P C Yuen . Dynamic graph co-matching for unsupervised video-based person re-identification. IEEE Transactions on Image Processing, 2019, 28( 6): 2976– 2990
10 M Ye, X Lan, P C Yuen. Robust anchor embedding for unsupervised video person re-identification in the wild. In: Proceedings of the European Conference on Computer Vision. 2018, 170– 186
11 Ye M, Shen J, Lin G, Xiang T, Shao L, Hoi S C. Deep learning for person re-identification: a survey and outlook. 2020, arXiv preprint arXiv: 2001.04193
12 X Shi, Z Chen, H Wang, D Y Yeung, W K Wong, W C Woo. Convolutional lstm network: a machine learning approach for precipitation nowcasting. In: Proceedings of the 28th International Conference on Neural Information Processing Systems. 2015, 802– 810
13 S Ioffe, C Szegedy. Batch normalization: accelerating deep network training by reducing internal covariate shift. In: Proceedings of International Conference on International Conference on Machine Learning. 2015
14 I Laptev . On space-time interest points. International Journal of Computer Vision, 2005, 64( 2–3): 107– 123
15 P Scovanner, S Ali, M Shah. A 3-dimensional sift descriptor and its application to action recognition. In: Proceedings of the 15th ACM International Conference on Multimedia. 2007, 357– 360
16 A Klaser, M Marszałek, C Schmid. A spatio-temporal descriptor based on 3d-gradients. In: Proceedings of the British Machine Vision Conference. 2008
17 T Wang , H Snoussi . Detection of abnormal visual events via global optical flow orientation histogram. IEEE Transactions on Information Forensics and Security, 2014, 9( 6): 988– 998
18 N Dalal, B Triggs, C Schmid. Human detection using oriented histograms of flow and appearance. In: Proceedings of the European Conference on Computer Vision. 2006, 428– 441
19 H Wang, C Schmid. Action recognition with improved trajectories. In: Proceedings of the IEEE International Conference on Computer Vision. 2013, 3551−3558
20 J Carreira, A Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 2017, 4724−4733
21 K He, X Zhang, S Ren, J Sun. Identity mappings in deep residual networks. In: Proceedings of European Conference on Computer Vision. 2016, 630– 645
22 S Xie, R Girshick, P Dollár, Z Tu, K He. Aggregated residual transformations for deep neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017, 1492−1500
23 L Wang, W Li, W Li, L Van Gool. Appearance-and-relation networks for video classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018, 1430−1439
24 K Simonyan, A Zisserman. Very deep convolutional networks for large-scale image recognition. In: Proceedings of the International Conference on Learning Representations. 2015, 1– 14
25 Kong C, Lucey S. Take it in your stride: do we need striding in CNNs? 2017, arXiv preprint arXiv: 1712.02502
26 C Guo , Y l Liu , X Jiao . Study on the influence of variable stride scale change on image recognition in CNN. Multimedia Tools and Applications, 2019, 78( 21): 30027– 30037
27 L Zaniolo , O Marques . On the use of variable stride in convolutional neural networks. Multimedia Tools and Applications, 2020, 79( 19): 13581– 13598
28 K Simonyan, A Zisserman. Two-stream convolutional networks for action recognition in videos. In: Proceedings of the 28th International Conference on Neural Information Processing Systems. 2014, 568– 576
29 T Wang , M Qiao , A Zhu , G Shan , H Snoussi . Abnormal event detection via the analysis of multi-frame optical flow information. Frontiers of Computer Science, 2020, 14( 2): 304– 313
30 L Zhang, G Zhu, P Shen, J Song, S A Shah, M Bennamoun. Learning spatiotemporal features using 3dcnn and convolutional lstm for gesture recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017, 3120−3128
31 Kay W, Carreira J, Simonyan K, Zhang B, Hillier C, Vijayanarasimhan S, Viola F, Green T, Back T, Natsev P, et al. The kinetics human action video dataset. 2017, arXiv preprint arXiv: 1705.06950
32 Yue-Hei Ng J, Hausknecht M, Vijayanarasimhan S, Vinyals O, Monga R, Toderici G. Beyond short snippets: deep networks for video classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015, 4694−4702
33 J Donahue, L Anne Hendricks, S Guadarrama, M Rohrbach, S Venugopalan, K Saenko, T Darrell. Long-term recurrent convolutional networks for visual recognition and description. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015, 2625−2634
34 L Shen , R Hong , Y Hao . Advance on large scale near-duplicate video retrieval. Frontiers of Computer Science, 2020, 14( 5): 145702–
35 G Zhu , L Zhang , P Shen , J Song . Multimodal gesture recognition using 3-d convolution and convolutional LSTM. IEEE Access, 2017, 5 : 4517– 4524
36 C Laurent, G Pereyra, P Brakel, Y Zhang, Y Bengio. Batch normalized recurrent neural networks. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. 2016, 2657−2661
37 Amodei D, Ananthanarayanan S, Anubhai R, Bai J, Battenberg E, Case C, Casper J, Catanzaro B, Cheng Q, Chen G, et al. Deep speech 2: end-to-end speech recognition in english and mandarin. In: Proceedings of International Conference on Machine Learning. 2016, 173–182
38 T Cooijmans, N Ballas, C Laurent, Ç Gülçehre, A Courville. Recurrent batch normalization. 2016, arXiv preprint arXiv: 1603.09025
39 H Kuehne, H Jhuang, E Garrote, T Poggio, T Serre. Hmdb: a large video database for human motion recognition. In: Proceedings of the IEEE International Conference on Computer Vision. 2011, 2556−2563
40 K Soomro, A R Zamir, M Shah. Ucf101: a dataset of 101 human actions classes from videos in the wild. 2012, arXiv preprint arXiv: 1212.0402
41 G Varol , I Laptev , C Schmid . Long-term temporal convolutions for action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018, 40( 6): 1510– 1517
42 J Zheng , X Cao , B Zhang , X Zhen , X Su . Deep ensemble machine for video classification. IEEE Transactions on Neural Networks and Learning Systems, 2019, 30( 2): 553– 565
43 N A Tu , T Huynh-The , K U Khan , Y K Lee . Ml-hdp: a hierarchical bayesian nonparametric model for recognizing human actions in video. IEEE Transactions on Circuits and Systems for Video Technology, 2019, 29( 3): 800– 814
[1] Yongquan LIANG, Qiuyu SONG, Zhongying ZHAO, Hui ZHOU, Maoguo GONG. BA-GNN: Behavior-aware graph neural network for session-based recommendation[J]. Front. Comput. Sci., 2023, 17(6): 176613-.
[2] Yufei ZENG, Zhixin LI, Zhenbin CHEN, Huifang MA. Aspect-level sentiment analysis based on semantic heterogeneous graph convolutional network[J]. Front. Comput. Sci., 2023, 17(6): 176340-.
[3] Shuo TAN, Lei ZHANG, Xin SHU, Zizhou WANG. A feature-wise attention module based on the difference with surrounding features for convolutional neural networks[J]. Front. Comput. Sci., 2023, 17(6): 176338-.
[4] Yamin HU, Hao JIANG, Zongyao HU. Measuring code maintainability with deep neural networks[J]. Front. Comput. Sci., 2023, 17(6): 176214-.
[5] Jinwei LUO, Mingkai HE, Weike PAN, Zhong MING. BGNN: Behavior-aware graph neural network for heterogeneous session-based recommendation[J]. Front. Comput. Sci., 2023, 17(5): 175336-.
[6] Muazzam MAQSOOD, Sadaf YASMIN, Saira GILLANI, Maryam BUKHARI, Seungmin RHO, Sang-Soo YEO. An efficient deep learning-assisted person re-identification solution for intelligent video surveillance in smart cities[J]. Front. Comput. Sci., 2023, 17(4): 174329-.
[7] Yuan GAO, Xiang WANG, Xiangnan HE, Huamin FENG, Yongdong ZHANG. Rumor detection with self-supervised learning on texts and social graph[J]. Front. Comput. Sci., 2023, 17(4): 174611-.
[8] Shuang LIU, Fan ZHANG, Baiyang ZHAO, Renjie GUO, Tao CHEN, Meishan ZHANG. APPCorp: a corpus for Android privacy policy document structure analysis[J]. Front. Comput. Sci., 2023, 17(3): 173320-.
[9] Zhen WU, Xinyu DAI, Rui XIA. Pairwise tagging framework for end-to-end emotion-cause pair extraction[J]. Front. Comput. Sci., 2023, 17(2): 172314-.
[10] Zhe XUE, Junping DU, Xin XU, Xiangbin LIU, Junfu WANG, Feifei KOU. Few-shot node classification via local adaptive discriminant structure learning[J]. Front. Comput. Sci., 2023, 17(2): 172316-.
[11] Hongjia RUAN, Huihui SONG, Bo LIU, Yong CHENG, Qingshan LIU. Intellectual property protection for deep semantic segmentation models[J]. Front. Comput. Sci., 2023, 17(1): 171306-.
[12] Yi WEI, Mei XUE, Xin LIU, Pengxiang XU. Data fusing and joint training for learning with noisy labels[J]. Front. Comput. Sci., 2022, 16(6): 166338-.
[13] Donghong HAN, Yanru KONG, Jiayi HAN, Guoren WANG. A survey of music emotion recognition[J]. Front. Comput. Sci., 2022, 16(6): 166335-.
[14] Pinzhuo TIAN, Yang GAO. Improving meta-learning model via meta-contrastive loss[J]. Front. Comput. Sci., 2022, 16(5): 165331-.
[15] Yunyun WANG, Chao WANG, Hui XUE, Songcan CHEN. Self-corrected unsupervised domain adaptation[J]. Front. Comput. Sci., 2022, 16(5): 165323-.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed