|
|
Weakly supervised temporal action localization with proxy metric modeling |
Hongsheng XU1, Zihan CHEN2, Yu ZHANG2, Xin GENG2, Siya MI3,4(), Zhihong YANG1 |
1. NARI Group Corporation (State Grid Electric Power Research Institute), Nanjing 211106, China 2. School of Computer Science and Engineering, and the Key Lab of Computer Network and Information Integration (Ministry of Education), Southeast University, Nanjing 211189, China 3. School of Cyber Science and Engineering, Southeast University, Nanjing 211189, China 4. Purple Mountain Laboratories, Nanjing 211111, China |
|
|
Abstract Temporal localization is crucial for action video recognition. Since the manual annotations are expensive and time-consuming in videos, temporal localization with weak video-level labels is challenging but indispensable. In this paper, we propose a weakly-supervised temporal action localization approach in untrimmed videos. To settle this issue, we train the model based on the proxies of each action class. The proxies are used to measure the distances between action segments and different original action features. We use a proxy-based metric to cluster the same actions together and separate actions from backgrounds. Compared with state-of-the-art methods, our method achieved competitive results on the THUMOS14 and ActivityNet1.2 datasets.
|
Keywords
temporal action localization
weakly supervised videos
proxy metric
|
Corresponding Author(s):
Siya MI
|
Just Accepted Date: 09 September 2021
Issue Date: 04 August 2022
|
|
1 |
F, Ronchetti F, Quiroga L, Lanzarini C Estrebou . Distribution of action movements (DAM): a descriptor for human action recognition. Frontiers of Computer Science, 2015, 9( 6): 956– 965
|
2 |
K, Chen G, Ding J Han . Attribute-based supervised deep learning model for action recognition. Frontiers of Computer Science, 2017, 11( 2): 219– 229
|
3 |
J, Wang D, Chen J Yang . Human behavior classification by analyzing periodic motions. Frontiers of Computer Science, 2010, 4( 4): 580– 588
|
4 |
X, Zhu Z Liu . Human behavior clustering for anomaly detection. Frontiers of Computer Science in China, 2011, 5( 3): 279– 289
|
5 |
A, Chebieb Y A Ameur . A formal model for plastic human computer interfaces. Frontiers of Computer Science, 2018, 12( 2): 351– 375
|
6 |
W, Chen S, Zhu H, Wan J Feng . Dual quaternion based virtual hand interaction modeling. Science China Information Sciences, 2013, 56( 3): 1– 11
|
7 |
Z, Shou D, Wang S F Chang. Temporal action localization in untrimmed videos via multi-stage CNNs. In: Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2016, 1049– 1058
|
8 |
Z, Shou J, Chan A, Zareian K, Miyazawa S F Chang. CDC: convolutional-de-convolutional networks for precise temporal action localization in untrimmed videos. In: Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2017, 1417– 1426
|
9 |
H, Xu A, Das K Saenko. R-C3D: Region convolutional 3D network for temporal activity detection. In: Proceedings of 2017 IEEE International Conference on Computer Vision (ICCV). 2017, 5794– 5803
|
10 |
Y W, Chao S, Vijayanarasimhan B, Seybold D A, Ross J, Deng R Sukthankar. Rethinking the faster R-CNN architecture for temporal action localization. In: Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018, 1130– 1139
|
11 |
Y, Zhao Y, Xiong L, Wang Z, Wu X, Tang D Lin. Temporal action detection with structured segment networks. In: Proceedings of 2017 IEEE International Conference on Computer Vision (ICCV). 2017, 2933– 2942
|
12 |
T, Lin X, Liu X, Li E, Ding S Wen. BMN: boundary-matching network for temporal action proposal generation. In: Proceedings of 2019 IEEE/CVF International Conference on Computer Vision (ICCV). 2019, 3888– 3897
|
13 |
P, Nguyen B, Han T, Liu G Prasad. Weakly supervised action localization by sparse temporal pooling network. In: Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018, 6752− 6761
|
14 |
A, Islam R J Radke. Weakly supervised temporal action localization using deep metric learning. In: Proceedings of 2020 IEEE Winter Conference on Applications of Computer Vision (WACV). 2020, 536– 545
|
15 |
S, Paul S, Roy A K Roy-Chowdhury. W-TALC: weakly-supervised temporal activity localization and classification. In: Proceedings of the 15th European Conference on Computer Vision. 2018, 588– 607
|
16 |
D, Liu T, Jiang Y Wang. Completeness modeling and context separation for weakly supervised temporal action localization. In: Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2019, 1298– 1307
|
17 |
B, Shi Q, Dai Y, Mu J Wang. Weakly-supervised action localization by generative attention modeling. In: Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2020, 1006– 1016
|
18 |
B, Fernando C T Y, Chet H Bilen. Weakly supervised Gaussian networks for action detection. In: Proceedings of 2020 IEEE Winter Conference on Applications of Computer Vision (WACV). 2020, 526– 535
|
19 |
L, Huang Y, Huang W, Ouyang L Wang. Relational prototypical network for weakly supervised temporal action localization. In: Proceedings of the 34th AAAI Conference on Artificial Intelligence. 2020, 11053– 11060
|
20 |
M, Rashid H, Kjellström Y J Lee. Action graphs: weakly-supervised action localization with graph convolution networks. In: Proceedings of 2020 IEEE Winter Conference on Applications of Computer Vision (WACV). 2020, 604– 613
|
21 |
L, Wang Y, Xiong D, Lin Gool L Van. UntrimmedNets for weakly supervised action recognition and detection. In: Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2017, 6402– 6411
|
22 |
S, Narayan H, Cholakkal F S, Khan L Shao. 3C-Net: category count and center loss for weakly-supervised action localization. In: Proceedings of 2019 IEEE/CVF International Conference on Computer Vision (ICCV). 2019, 8678− 8686
|
23 |
S, Kim D, Kim M, Cho S Kwak. Proxy anchor loss for deep metric learning. In: Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2020, 3235– 3244
|
24 |
J, Carreira A Zisserman. Quo Vadis, action recognition? A new model and the Kinetics dataset. In: Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2017, 4724– 4733
|
25 |
C, Feichtenhofer A, Pinz A Zisserman. Convolutional two-stream network fusion for video action recognition. In: Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2016, 1933– 1941
|
26 |
A, Bendale T E Boult. Towards open set deep networks. In: Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2016, 1563– 1572
|
27 |
B, Lakshminarayanan A, Pritzel C Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. In: Proceedings of Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems. 2017, 6402– 6413
|
28 |
P, Lee J, Wang Y, Lu H Byun. Weakly-supervised temporal action localization by uncertainty modeling. 2020, arXiv preprint arXiv: 2006.07006
|
29 |
Y, Movshovitz-Attias A, Toshev T K, Leung S, Ioffe S Singh. No fuss distance metric learning using proxies. In: Proceedings of 2017 IEEE International Conference on Computer Vision (ICCV). 2017, 360– 368
|
30 |
H, Idrees A R, Zamir Y G, Jiang A, Gorban I, Laptev R, Sukthankar M Shah . The THUMOS challenge on action recognition for videos “in the wild”. Computer Vision and Image Understanding, 2017, 155: 1– 23
|
31 |
F C, Heilbron V, Escorcia B, Ghanem J C Niebles. ActivityNet: a large-scale video benchmark for human activity understanding. In: Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2015, 961– 970
|
32 |
Z, Shou H, Gao L, Zhang K, Miyazawa S F Chang. AutoLoc: weakly-supervised temporal action localization in untrimmed videos. In: Proceedings of the 15th European Conference on Computer Vision. 2018, 162– 179
|
33 |
P, Lee Y, Uh H Byun. Background suppression network for weakly-supervised temporal action localization. In: Proceedings of the 34th AAAI Conference on Artificial Intelligence. 2020, 11320– 11327
|
34 |
L, McInnes J, Healy J Melville. UMAP: uniform Manifold Approximation and Projection for Dimension Reduction, 2018, arXiv preprint arXiv:1802.03426v2
|
|
Viewed |
|
|
|
Full text
|
|
|
|
|
Abstract
|
|
|
|
|
Cited |
|
|
|
|
|
Shared |
|
|
|
|
|
Discussed |
|
|
|
|