Please wait a minute...
Frontiers of Computer Science

ISSN 2095-2228

ISSN 2095-2236(Online)

CN 10-1014/TP

Postal Subscription Code 80-970

2018 Impact Factor: 1.129

Front. Comput. Sci.    2023, Vol. 17 Issue (2) : 172309    https://doi.org/10.1007/s11704-022-1154-1
RESEARCH ARTICLE
Weakly supervised temporal action localization with proxy metric modeling
Hongsheng XU1, Zihan CHEN2, Yu ZHANG2, Xin GENG2, Siya MI3,4(), Zhihong YANG1
1. NARI Group Corporation (State Grid Electric Power Research Institute), Nanjing 211106, China
2. School of Computer Science and Engineering, and the Key Lab of Computer Network and Information Integration (Ministry of Education), Southeast University, Nanjing 211189, China
3. School of Cyber Science and Engineering, Southeast University, Nanjing 211189, China
4. Purple Mountain Laboratories, Nanjing 211111, China
 Download: PDF(12255 KB)   HTML
 Export: BibTeX | EndNote | Reference Manager | ProCite | RefWorks
Abstract

Temporal localization is crucial for action video recognition. Since the manual annotations are expensive and time-consuming in videos, temporal localization with weak video-level labels is challenging but indispensable. In this paper, we propose a weakly-supervised temporal action localization approach in untrimmed videos. To settle this issue, we train the model based on the proxies of each action class. The proxies are used to measure the distances between action segments and different original action features. We use a proxy-based metric to cluster the same actions together and separate actions from backgrounds. Compared with state-of-the-art methods, our method achieved competitive results on the THUMOS14 and ActivityNet1.2 datasets.

Keywords temporal action localization      weakly supervised videos      proxy metric     
Corresponding Author(s): Siya MI   
Just Accepted Date: 09 September 2021   Issue Date: 04 August 2022
 Cite this article:   
Hongsheng XU,Zihan CHEN,Yu ZHANG, et al. Weakly supervised temporal action localization with proxy metric modeling[J]. Front. Comput. Sci., 2023, 17(2): 172309.
 URL:  
https://academic.hep.com.cn/fcs/EN/10.1007/s11704-022-1154-1
https://academic.hep.com.cn/fcs/EN/Y2023/V17/I2/172309
Fig.1  Overview of our method. In the proxy calculationg module, the action proxy map is trained based on the selection of pseudo-action segments and pseudo-background segments. Proxy metric learning module aims to cluster the same actions and separate actions from other backgrounds
Supervision Method mAP@IoU AVG
0.3 0.4 0.5 0.6 0.7
Full S-CNN [7] 36.3 28.7 19.0 10.3 5.3 19.9
Full CDC [8] 40.1 29.4 23.3 13.1 7.9 22.8
Full R-C3D [9] 44.8 35.6 28.9 ? ? ?
Full TAL-Net [10] 53.2 48.5 42.8 33.8 20.8 39.8
Weak STPN (UNT) [13] 31.1 23.5 16.2 9.8 5.1 17.1 ?
Weak STPN (I3D) [13] 35.5 25.8 16.9 9.9 4.3 18.5 ?
Weak Liu et al. [16] 41.2 32.1 23.1 15.0 7.0 23.7 ?
Weak W-TALC [15] 40.1 31.1 22.8 ? 7.6 25.4 ?
Weak RPN [19] 48.2 37.2 27.9 16.7 8.1 27.6 ?
Weak BMU [28] 46.9 39.2 30.7 20.8 12.5 30.0 °
Weak Ours 46.8 39.1 30.9 21.0 12.6 30.1
Tab.1  Action localization performance compared on the THUMOS14 dataset. The last column AVG indicates the average mAP at IoU thresholds 0.3:0.1:0.7
Supervision Method mAP@IoU AVG
0.5 0.7 0.9
Full SSN [11] 41.3 30.4 13.2 28.3
Weak UtNets [21] 7.4 3.9 1.2 4.2
Weak AutoLoc [32] 27.3 17.5 6.8 17.2
Weak AG [20] 29.4 17.5 7.5 18.1
Weak W-TALC [15] 37.0 14.6 4.2 18.6
Weak BSN [33] 38.5 27.1 11.9 25.8
Weak Ours 39.7 27.6 12.0 26.4
Tab.2  Action localization performance on ActivityNet1.2 dataset
Method AVG
Lcls Lum Lbe Lack Lbkg Fusion
23.1
28.5
28.9
28.7
Tab.3  Our Ablation study on THUMOS14 dataset for different components
Fig.2  Proxy map of 20 action classes on THUMOS14. The feature vector of each proxy is 2048-dimensional with both RGB and flow streams. It is best viewed in color images
Fig.3  Proxy map of 20 action classes on THUMOS14. The feature vector of each proxy is 2048-dimensional with both RGB and flow streams. It is best viewed in color images
Fig.4  The effect of Kb on the average mAP under IoU thresholds 0.3:0.1:0.7. The horizontal axis represents the number of training segments in pseudo-background segments to conduct background separation for each video
Fig.5  The qualitative results of VolleyballSpikingGolfSwing action on THUMOS14
Fig.6  The qualitative results of Shotput action on THUMOS14
Fig.7  The qualitative results of GolfSwing action on THUMOS14
  
  
  
  
  
  
1 F, Ronchetti F, Quiroga L, Lanzarini C Estrebou . Distribution of action movements (DAM): a descriptor for human action recognition. Frontiers of Computer Science, 2015, 9( 6): 956– 965
2 K, Chen G, Ding J Han . Attribute-based supervised deep learning model for action recognition. Frontiers of Computer Science, 2017, 11( 2): 219– 229
3 J, Wang D, Chen J Yang . Human behavior classification by analyzing periodic motions. Frontiers of Computer Science, 2010, 4( 4): 580– 588
4 X, Zhu Z Liu . Human behavior clustering for anomaly detection. Frontiers of Computer Science in China, 2011, 5( 3): 279– 289
5 A, Chebieb Y A Ameur . A formal model for plastic human computer interfaces. Frontiers of Computer Science, 2018, 12( 2): 351– 375
6 W, Chen S, Zhu H, Wan J Feng . Dual quaternion based virtual hand interaction modeling. Science China Information Sciences, 2013, 56( 3): 1– 11
7 Z, Shou D, Wang S F Chang. Temporal action localization in untrimmed videos via multi-stage CNNs. In: Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2016, 1049– 1058
8 Z, Shou J, Chan A, Zareian K, Miyazawa S F Chang. CDC: convolutional-de-convolutional networks for precise temporal action localization in untrimmed videos. In: Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2017, 1417– 1426
9 H, Xu A, Das K Saenko. R-C3D: Region convolutional 3D network for temporal activity detection. In: Proceedings of 2017 IEEE International Conference on Computer Vision (ICCV). 2017, 5794– 5803
10 Y W, Chao S, Vijayanarasimhan B, Seybold D A, Ross J, Deng R Sukthankar. Rethinking the faster R-CNN architecture for temporal action localization. In: Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018, 1130– 1139
11 Y, Zhao Y, Xiong L, Wang Z, Wu X, Tang D Lin. Temporal action detection with structured segment networks. In: Proceedings of 2017 IEEE International Conference on Computer Vision (ICCV). 2017, 2933– 2942
12 T, Lin X, Liu X, Li E, Ding S Wen. BMN: boundary-matching network for temporal action proposal generation. In: Proceedings of 2019 IEEE/CVF International Conference on Computer Vision (ICCV). 2019, 3888– 3897
13 P, Nguyen B, Han T, Liu G Prasad. Weakly supervised action localization by sparse temporal pooling network. In: Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018, 6752− 6761
14 A, Islam R J Radke. Weakly supervised temporal action localization using deep metric learning. In: Proceedings of 2020 IEEE Winter Conference on Applications of Computer Vision (WACV). 2020, 536– 545
15 S, Paul S, Roy A K Roy-Chowdhury. W-TALC: weakly-supervised temporal activity localization and classification. In: Proceedings of the 15th European Conference on Computer Vision. 2018, 588– 607
16 D, Liu T, Jiang Y Wang. Completeness modeling and context separation for weakly supervised temporal action localization. In: Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2019, 1298– 1307
17 B, Shi Q, Dai Y, Mu J Wang. Weakly-supervised action localization by generative attention modeling. In: Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2020, 1006– 1016
18 B, Fernando C T Y, Chet H Bilen. Weakly supervised Gaussian networks for action detection. In: Proceedings of 2020 IEEE Winter Conference on Applications of Computer Vision (WACV). 2020, 526– 535
19 L, Huang Y, Huang W, Ouyang L Wang. Relational prototypical network for weakly supervised temporal action localization. In: Proceedings of the 34th AAAI Conference on Artificial Intelligence. 2020, 11053– 11060
20 M, Rashid H, Kjellström Y J Lee. Action graphs: weakly-supervised action localization with graph convolution networks. In: Proceedings of 2020 IEEE Winter Conference on Applications of Computer Vision (WACV). 2020, 604– 613
21 L, Wang Y, Xiong D, Lin Gool L Van. UntrimmedNets for weakly supervised action recognition and detection. In: Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2017, 6402– 6411
22 S, Narayan H, Cholakkal F S, Khan L Shao. 3C-Net: category count and center loss for weakly-supervised action localization. In: Proceedings of 2019 IEEE/CVF International Conference on Computer Vision (ICCV). 2019, 8678− 8686
23 S, Kim D, Kim M, Cho S Kwak. Proxy anchor loss for deep metric learning. In: Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2020, 3235– 3244
24 J, Carreira A Zisserman. Quo Vadis, action recognition? A new model and the Kinetics dataset. In: Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2017, 4724– 4733
25 C, Feichtenhofer A, Pinz A Zisserman. Convolutional two-stream network fusion for video action recognition. In: Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2016, 1933– 1941
26 A, Bendale T E Boult. Towards open set deep networks. In: Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2016, 1563– 1572
27 B, Lakshminarayanan A, Pritzel C Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. In: Proceedings of Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems. 2017, 6402– 6413
28 P, Lee J, Wang Y, Lu H Byun. Weakly-supervised temporal action localization by uncertainty modeling. 2020, arXiv preprint arXiv: 2006.07006
29 Y, Movshovitz-Attias A, Toshev T K, Leung S, Ioffe S Singh. No fuss distance metric learning using proxies. In: Proceedings of 2017 IEEE International Conference on Computer Vision (ICCV). 2017, 360– 368
30 H, Idrees A R, Zamir Y G, Jiang A, Gorban I, Laptev R, Sukthankar M Shah . The THUMOS challenge on action recognition for videos “in the wild”. Computer Vision and Image Understanding, 2017, 155: 1– 23
31 F C, Heilbron V, Escorcia B, Ghanem J C Niebles. ActivityNet: a large-scale video benchmark for human activity understanding. In: Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2015, 961– 970
32 Z, Shou H, Gao L, Zhang K, Miyazawa S F Chang. AutoLoc: weakly-supervised temporal action localization in untrimmed videos. In: Proceedings of the 15th European Conference on Computer Vision. 2018, 162– 179
33 P, Lee Y, Uh H Byun. Background suppression network for weakly-supervised temporal action localization. In: Proceedings of the 34th AAAI Conference on Artificial Intelligence. 2020, 11320– 11327
34 L, McInnes J, Healy J Melville. UMAP: uniform Manifold Approximation and Projection for Dimension Reduction, 2018, arXiv preprint arXiv:1802.03426v2
[1] FCS-21154-OF-HX_suppl_1 Download
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed