Weakly supervised temporal action localization with proxy metric modeling

doi:10.1007/s11704-022-1154-1

Front. Comput. Sci.

2023, Vol. 17

Issue (2) : 172309 https://doi.org/10.1007/s11704-022-1154-1

RESEARCH ARTICLE

Weakly supervised temporal action localization with proxy metric modeling

Hongsheng XU¹, Zihan CHEN², Yu ZHANG², Xin GENG², Siya MI^3,⁴(

), Zhihong YANG¹

¹. NARI Group Corporation (State Grid Electric Power Research Institute), Nanjing 211106, China
². School of Computer Science and Engineering, and the Key Lab of Computer Network and Information Integration (Ministry of Education), Southeast University, Nanjing 211189, China
³. School of Cyber Science and Engineering, Southeast University, Nanjing 211189, China
⁴. Purple Mountain Laboratories, Nanjing 211111, China

Download: PDF(12255 KB) HTML
Export: BibTeX | EndNote | Reference Manager | ProCite | RefWorks

Abstract

Temporal localization is crucial for action video recognition. Since the manual annotations are expensive and time-consuming in videos, temporal localization with weak video-level labels is challenging but indispensable. In this paper, we propose a weakly-supervised temporal action localization approach in untrimmed videos. To settle this issue, we train the model based on the proxies of each action class. The proxies are used to measure the distances between action segments and different original action features. We use a proxy-based metric to cluster the same actions together and separate actions from backgrounds. Compared with state-of-the-art methods, our method achieved competitive results on the THUMOS14 and ActivityNet1.2 datasets.

Keywords temporal action localization weakly supervised videos proxy metric

Corresponding Author(s): Siya MI

Just Accepted Date: 09 September 2021 Issue Date: 04 August 2022

Cite this article:

Hongsheng XU,Zihan CHEN,Yu ZHANG, et al. Weakly supervised temporal action localization with proxy metric modeling[J]. Front. Comput. Sci., 2023, 17(2): 172309.

URL:

https://academic.hep.com.cn/fcs/EN/10.1007/s11704-022-1154-1
https://academic.hep.com.cn/fcs/EN/Y2023/V17/I2/172309

Fig.1 Overview of our method. In the proxy calculationg module, the action proxy map is trained based on the selection of pseudo-action segments and pseudo-background segments. Proxy metric learning module aims to cluster the same actions and separate actions from other backgrounds

Supervision	Method	mAP@IoU					AVG
Supervision	Method	0.3	0.4	0.5	0.6	0.7	AVG
Full	S-CNN [7]	36.3	28.7	19.0	10.3	5.3	19.9
Full	CDC [8]	40.1	29.4	23.3	13.1	7.9	22.8
Full	R-C3D [9]	44.8	35.6	28.9	?	?	?
Full	TAL-Net [10]	53.2	48.5	42.8	33.8	20.8	39.8
Weak	STPN (UNT) [13]	31.1	23.5	16.2	9.8	5.1	17.1 $?$
Weak	STPN (I3D) [13]	35.5	25.8	16.9	9.9	4.3	18.5 $?$
Weak	Liu et al. [16]	41.2	32.1	23.1	15.0	7.0	23.7 $?$
Weak	W-TALC [15]	40.1	31.1	22.8	?	7.6	25.4 $?$
Weak	RPN [19]	48.2	37.2	27.9	16.7	8.1	27.6 $?$
Weak	BMU [28]	46.9	39.2	30.7	20.8	12.5	30.0 $°$
Weak	Ours	46.8	39.1	$30.9$	$21.0$	$12.6$	$30.1$

Tab.1 Action localization performance compared on the THUMOS14 dataset. The last column AVG indicates the average mAP at IoU thresholds 0.3:0.1:0.7

Supervision	Method	mAP@IoU			AVG
Supervision	Method	0.5	0.7	0.9	AVG
Full	SSN [11]	41.3	30.4	13.2	28.3
Weak	UtNets [21]	7.4	3.9	1.2	4.2
Weak	AutoLoc [32]	27.3	17.5	6.8	17.2
Weak	AG [20]	29.4	17.5	7.5	18.1
Weak	W-TALC [15]	37.0	14.6	4.2	18.6
Weak	BSN [33]	38.5	27.1	11.9	25.8
Weak	Ours	$39.7$	$27.6$	$12.0$	$26.4$

Tab.2 Action localization performance on ActivityNet1.2 dataset

Method						AVG
$L c l s$	$L u m$	$L b e$	$L a c k$	$L b k g$	Fusion	AVG
√						23.1
√	√	√				28.5
√	√	√	√			28.9
√	√	√		√		28.7
√	√	√	√	√		√
√	√	√	√	√	√	√

Tab.3 Our Ablation study on THUMOS14 dataset for different components

Fig.2 Proxy map of 20 action classes on THUMOS14. The feature vector of each proxy is 2048-dimensional with both RGB and flow streams. It is best viewed in color images

Fig.3 Proxy map of 20 action classes on THUMOS14. The feature vector of each proxy is 2048-dimensional with both RGB and flow streams. It is best viewed in color images

Fig.4 The effect of

K b

on the average mAP under IoU thresholds 0.3:0.1:0.7. The horizontal axis represents the number of training segments in pseudo-background segments to conduct background separation for each video

Fig.5 The qualitative results of VolleyballSpikingGolfSwing action on THUMOS14

Fig.6 The qualitative results of Shotput action on THUMOS14

Fig.7 The qualitative results of GolfSwing action on THUMOS14

1	F, Ronchetti F, Quiroga L, Lanzarini C Estrebou . Distribution of action movements (DAM): a descriptor for human action recognition. Frontiers of Computer Science, 2015, 9( 6): 956– 965
2	K, Chen G, Ding J Han . Attribute-based supervised deep learning model for action recognition. Frontiers of Computer Science, 2017, 11( 2): 219– 229
3	J, Wang D, Chen J Yang . Human behavior classification by analyzing periodic motions. Frontiers of Computer Science, 2010, 4( 4): 580– 588
4	X, Zhu Z Liu . Human behavior clustering for anomaly detection. Frontiers of Computer Science in China, 2011, 5( 3): 279– 289
5	A, Chebieb Y A Ameur . A formal model for plastic human computer interfaces. Frontiers of Computer Science, 2018, 12( 2): 351– 375
6	W, Chen S, Zhu H, Wan J Feng . Dual quaternion based virtual hand interaction modeling. Science China Information Sciences, 2013, 56( 3): 1– 11
7	Z, Shou D, Wang S F Chang. Temporal action localization in untrimmed videos via multi-stage CNNs. In: Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2016, 1049– 1058
8	Z, Shou J, Chan A, Zareian K, Miyazawa S F Chang. CDC: convolutional-de-convolutional networks for precise temporal action localization in untrimmed videos. In: Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2017, 1417– 1426
9	H, Xu A, Das K Saenko. R-C3D: Region convolutional 3D network for temporal activity detection. In: Proceedings of 2017 IEEE International Conference on Computer Vision (ICCV). 2017, 5794– 5803
10	Y W, Chao S, Vijayanarasimhan B, Seybold D A, Ross J, Deng R Sukthankar. Rethinking the faster R-CNN architecture for temporal action localization. In: Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018, 1130– 1139
11	Y, Zhao Y, Xiong L, Wang Z, Wu X, Tang D Lin. Temporal action detection with structured segment networks. In: Proceedings of 2017 IEEE International Conference on Computer Vision (ICCV). 2017, 2933– 2942
12	T, Lin X, Liu X, Li E, Ding S Wen. BMN: boundary-matching network for temporal action proposal generation. In: Proceedings of 2019 IEEE/CVF International Conference on Computer Vision (ICCV). 2019, 3888– 3897
13	P, Nguyen B, Han T, Liu G Prasad. Weakly supervised action localization by sparse temporal pooling network. In: Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018, 6752− 6761
14	A, Islam R J Radke. Weakly supervised temporal action localization using deep metric learning. In: Proceedings of 2020 IEEE Winter Conference on Applications of Computer Vision (WACV). 2020, 536– 545
15	S, Paul S, Roy A K Roy-Chowdhury. W-TALC: weakly-supervised temporal activity localization and classification. In: Proceedings of the 15th European Conference on Computer Vision. 2018, 588– 607
16	D, Liu T, Jiang Y Wang. Completeness modeling and context separation for weakly supervised temporal action localization. In: Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2019, 1298– 1307
17	B, Shi Q, Dai Y, Mu J Wang. Weakly-supervised action localization by generative attention modeling. In: Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2020, 1006– 1016
18	B, Fernando C T Y, Chet H Bilen. Weakly supervised Gaussian networks for action detection. In: Proceedings of 2020 IEEE Winter Conference on Applications of Computer Vision (WACV). 2020, 526– 535
19	L, Huang Y, Huang W, Ouyang L Wang. Relational prototypical network for weakly supervised temporal action localization. In: Proceedings of the 34th AAAI Conference on Artificial Intelligence. 2020, 11053– 11060
20	M, Rashid H, Kjellström Y J Lee. Action graphs: weakly-supervised action localization with graph convolution networks. In: Proceedings of 2020 IEEE Winter Conference on Applications of Computer Vision (WACV). 2020, 604– 613
21	L, Wang Y, Xiong D, Lin Gool L Van. UntrimmedNets for weakly supervised action recognition and detection. In: Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2017, 6402– 6411
22	S, Narayan H, Cholakkal F S, Khan L Shao. 3C-Net: category count and center loss for weakly-supervised action localization. In: Proceedings of 2019 IEEE/CVF International Conference on Computer Vision (ICCV). 2019, 8678− 8686
23	S, Kim D, Kim M, Cho S Kwak. Proxy anchor loss for deep metric learning. In: Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2020, 3235– 3244
24	J, Carreira A Zisserman. Quo Vadis, action recognition? A new model and the Kinetics dataset. In: Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2017, 4724– 4733
25	C, Feichtenhofer A, Pinz A Zisserman. Convolutional two-stream network fusion for video action recognition. In: Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2016, 1933– 1941
26	A, Bendale T E Boult. Towards open set deep networks. In: Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2016, 1563– 1572
27	B, Lakshminarayanan A, Pritzel C Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. In: Proceedings of Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems. 2017, 6402– 6413
28	P, Lee J, Wang Y, Lu H Byun. Weakly-supervised temporal action localization by uncertainty modeling. 2020, arXiv preprint arXiv: 2006.07006
29	Y, Movshovitz-Attias A, Toshev T K, Leung S, Ioffe S Singh. No fuss distance metric learning using proxies. In: Proceedings of 2017 IEEE International Conference on Computer Vision (ICCV). 2017, 360– 368
30	H, Idrees A R, Zamir Y G, Jiang A, Gorban I, Laptev R, Sukthankar M Shah . The THUMOS challenge on action recognition for videos “in the wild”. Computer Vision and Image Understanding, 2017, 155: 1– 23
31	F C, Heilbron V, Escorcia B, Ghanem J C Niebles. ActivityNet: a large-scale video benchmark for human activity understanding. In: Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2015, 961– 970
32	Z, Shou H, Gao L, Zhang K, Miyazawa S F Chang. AutoLoc: weakly-supervised temporal action localization in untrimmed videos. In: Proceedings of the 15th European Conference on Computer Vision. 2018, 162– 179
33	P, Lee Y, Uh H Byun. Background suppression network for weakly-supervised temporal action localization. In: Proceedings of the 34th AAAI Conference on Artificial Intelligence. 2020, 11320– 11327
34	L, McInnes J, Healy J Melville. UMAP: uniform Manifold Approximation and Projection for Dimension Reduction, 2018, arXiv preprint arXiv:1802.03426v2

[1]

FCS-21154-OF-HX_suppl_1

Download

Viewed

Full text

Abstract

Cited

Shared

Discussed