Please wait a minute...
Frontiers of Computer Science

ISSN 2095-2228

ISSN 2095-2236(Online)

CN 10-1014/TP

Postal Subscription Code 80-970

2018 Impact Factor: 1.129

Front. Comput. Sci.    2019, Vol. 13 Issue (2) : 302-317    https://doi.org/10.1007/s11704-018-8015-y
RESEARCH ARTICLE
Soft video parsing by label distribution learning
Miaogen LING, Xin GENG()
Department of Computer Science and Engineering, Southeast University, Nanjing 211189, China
 Download: PDF(860 KB)  
 Export: BibTeX | EndNote | Reference Manager | ProCite | RefWorks
Abstract

In this paper, we tackle the problem of segmenting out a sequence of actions from videos. The videos contain background and actions which are usually composed of ordered sub-actions. We refer the sub-actions and the background as semantic units. Considering the possible overlap between two adjacent semantic units, we propose a bidirectional sliding window method to generate the label distributions for various segments in the video. The label distribution covers a certain number of semantic unit labels, representing the degree to which each label describes the video segment. The mapping from a video segment to its label distribution is then learned by a Label Distribution Learning (LDL) algorithm. Based on the LDL model, a soft video parsing method with segmental regular grammars is proposed to construct a tree structure for the video. Each leaf of the tree stands for a video clip of background or sub-action. The proposed method shows promising results on the THUMOS’14, MSR-II and UCF101 datasets and its computational complexity is much less than the compared state-of-the-art video parsing method.

Keywords video parsing      label distribution learning      subactions      graduality     
Corresponding Author(s): Xin GENG   
Just Accepted Date: 14 June 2018   Online First Date: 04 September 2018    Issue Date: 08 April 2019
 Cite this article:   
Miaogen LING,Xin GENG. Soft video parsing by label distribution learning[J]. Front. Comput. Sci., 2019, 13(2): 302-317.
 URL:  
https://academic.hep.com.cn/fcs/EN/10.1007/s11704-018-8015-y
https://academic.hep.com.cn/fcs/EN/Y2019/V13/I2/302
1 HPirsiavash, D Ramanan. Parsing videos of actions with segmental grammars. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2014, 612–619
https://doi.org/10.1109/CVPR.2014.85
2 FCaba Heilbron, J Carlos Niebles, BGhanem. Fast temporal activity proposals for efficient detection of human actions in untrimmed videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016, 1914–1923
https://doi.org/10.1109/CVPR.2016.211
3 DOneata, J Verbeek, CSchmid. The LEAR submission at thumos 2014. 2014, hal-01074442
4 ZShou, DWang, S FChang. Temporal action localization in untrimmed videos via multi-stage CNNs. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016, 1049–1058
https://doi.org/10.1109/CVPR.2016.119
5 HWang, DOneata, JVerbeek, C Schmid. A robust and efficient video representation for action recognition. International Journal of Computer Vision, 2016, 119(3): 219–238
https://doi.org/10.1007/s11263-015-0846-5
6 JYuan, BNi, XYang, A A Kassim. Temporal action localization with pyramid of score distribution features. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016, 3093–3102
https://doi.org/10.1109/CVPR.2016.337
7 XGeng. Label distribution learning. IEEE Transactions on Knowledge and Data Engineering, 2016, 28(7): 1734–1748
https://doi.org/10.1109/TKDE.2016.2545658
8 XGeng, PHou. Pre-release prediction of crowd opinion on movies by label distribution learning. In: Proceedings of the 24th International Joint Conference on Artificial Intelligence. 2015, 3511–3517
9 XGeng, LLuo. Multilabel ranking with inconsistent rankers. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2014, 3742–3747
https://doi.org/10.1109/CVPR.2014.478
10 XGeng, YXia. Head pose estimation based on multivariate label distribution. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2014, 1837–1842
https://doi.org/10.1109/CVPR.2014.237
11 XGeng, CYin, Z HZhou. Facial age estimation by learning from label distributions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013, 35(10): 2401–2412
https://doi.org/10.1109/TPAMI.2013.51
12 XGeng, Z HZhou, KSmith-Miles. Automatic age estimation based on facial aging patterns. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2007, 29(12): 2234–2240
https://doi.org/10.1109/TPAMI.2007.70733
13 DZhou, YZhou, XZhang, Q Zhao, XGeng. Emotion distribution learning from texts. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing. 2016, 638–647
https://doi.org/10.18653/v1/D16-1061
14 YZhou, HXue, XGeng. Emotion distribution recognition from facial expressions. In: Proceedings of the 23rd Annual ACM Conference on Multimedia Conference. 2015, 1247–1250
https://doi.org/10.1145/2733373.2806328
15 CXing, XGeng, HXue. Logistic boosting regression for label distribution learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016, 4489–4497
https://doi.org/10.1109/CVPR.2016.486
16 WShen, KZhao , YGuo, A L Yuille. Label distribution learning forests. Advances in Neural Information Processing Systems. 2017, 834–843
17 XGeng, MLing. Soft video parsing by label distribution learning. In: Proceedings of the 31st AAAI Conference on Artificial Intelligence. 2017, 1331–1337
18 ANeubeck, L Van Gool. Efficient non-maximum suppression. In: Proceedings of the 18th IEEE International Conference on Pattern Recognition. 2006, 850–855
https://doi.org/10.1109/ICPR.2006.479
19 MHoai, Z ZLan, FDe la Torre. Joint segmentation and classification of human actions in video. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2011, 3265–3272
https://doi.org/10.1109/CVPR.2011.5995470
20 QShi, LCheng, LWang, A Smola. Human action segmentation and recognition using discriminative semi-markov models. International Journal of Computer Vision, 2011, 93(1): 22–32
https://doi.org/10.1007/s11263-010-0384-0
21 QShi, LWang, LCheng, A Smola. Discriminative human action segmentation and recognition using semi-markov model. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2008, 1–8
22 KTang, F FLi, DKoller. Learning latent temporal structure for complex event detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2012, 1250–1257
https://doi.org/10.1109/CVPR.2012.6247808
23 YXiong, YZhao, LWang, D Lin, XTang. A pursuit of temporal accuracy in general activity detection. 2017, arXiv preprint arXiv:1703.02716
24 LWang, YXiong, ZWang, Y Qiao, DLin, XTang, L Van Gool. Temporal segment networks: towards good practices for deep action recognition. In: Proceedings of the European Conference on Computer Vision. 2016, 20–36
https://doi.org/10.1007/978-3-319-46484-8_2
25 JGao, ZYang, CSun, K Chen, RNevatia . Turn tap: temporal unit regression network for temporal action proposals. 2017, arXiv preprint arXiv:1703.06189
26 ZShou, JChan, AZareian, K Miyazawa, S FChang. CDC: convolutional-de-convolutional networks for precise temporal action localization in untrimmed videos. 2017, arXiv preprint arXiv:1703.01515
27 J LElman. Finding structure in time. Cognitive Science, 1990, 14(2): 179–211
https://doi.org/10.1207/s15516709cog1402_1
28 SHochreiter, J Schmidhuber. Long short-term memory. Neural Computation, 1997, 9(8): 1735–1780
https://doi.org/10.1162/neco.1997.9.8.1735
29 NChomsky. Three models for the description of language. IEEE Transactions on Information Theory, 1956, 2(3): 113–124
https://doi.org/10.1109/TIT.1956.1056813
30 MDatar, N Immorlica, PIndyk, V SMirrokni. Locality-sensitive hashing scheme based on p-stable distributions. In: Proceedings of the 20th Annual Symposium on Computational Geometry. 2004, 253–262
https://doi.org/10.1145/997817.997857
31 SBelongie, JMalik, JPuzicha. Shape matching and object recognition using shape contexts. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2002, 24(4): 509–522
https://doi.org/10.1109/34.993558
32 SBoyd, L Vandenberghe. Convex Optimization. Cambridge: Cambridge University Press, 2004
https://doi.org/10.1017/CBO9780511804441
33 A LBerger, V J DPietra, S A DPietra. A maximum entropy approach to natural language processing. Computational Linguistics, 1996, 22(1): 39–71
34 D CLiu, J Nocedal. On the limited memory BFGS method for large scale optimization. Mathematical Programming, 1989, 45(1-3): 503–528
https://doi.org/10.1007/BF01589116
35 C DManning, H Schütze. Foundations of Statistical Natural Language Processing. Mass: MIT Press, 1999
36 Y GJiang, JLiu, A RZamir, G Toderici, ILaptev, MShah, R Sukthankar. THUMOS challenge: action recognition with a large number of classes. In: Proceedings of the 1st International Workshop on Action Recognition with a large Number of Classes. 2014
37 JYuan, ZLiu, YWu. Discriminative video pattern search for efficient action detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2011, 33(9): 1728–1743
https://doi.org/10.1109/TPAMI.2011.38
38 KSoomro, A RZamir, MShah. UCF101: a dataset of 101 human actions classes from videos in the wild. 2012, arXiv preprint arXiv:1212.0402
39 ILaptev, M Marszałek, CSchmid, BRozenfeld. Learning realistic human actions from movies. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2008, 1–8
https://doi.org/10.1109/CVPR.2008.4587756
40 AVedaldi, A. ZissermanEfficient additive kernels via explicit feature maps. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2012, 34(3): 480–492
https://doi.org/10.1109/TPAMI.2011.153
41 MEveringham, JWinn. The pascal visual object classes challenge 2012 (VOC2012) development kit. Pattern Analysis, Statistical Modelling and Computational Learning, Technical Report, 2011
42 KSimonyan, A Zisserman. Two-stream convolutional networks for action recognition in videos. Advances in Neural Information Processing Systems. 2014, 568–576
43 DTran, L Bourdev, RFergus, LTorresani, MPaluri. Learning spatiotemporal features with 3D convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision. 2015, 4489–4497
https://doi.org/10.1109/ICCV.2015.510
[1] Huiying ZHANG, Yu ZHANG, Xin GENG. Practical age estimation using deep label distribution learning[J]. Front. Comput. Sci., 2021, 15(3): 153318-.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed