|
|
Weakly supervised action anticipation without object annotations |
Yi ZHONG1, Jia-Hui PAN1, Haoxin LI1, Wei-Shi ZHENG1,2() |
1. School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou 510006, China 2. Key Laboratory of Machine Intelligence and Advanced Computing, Ministry of Education, Guangzhou 510006, China |
|
|
Abstract Anticipating future actions without observing any partial videos of future actions plays an important role in action prediction and is also a challenging task. To obtain abundant information for action anticipation, some methods integrate multimodal contexts, including scene object labels. However, extensively labelling each frame in video datasets requires considerable effort. In this paper, we develop a weakly supervised method that integrates global motion and local fine-grained features from current action videos to predict next action label without the need for specific scene context labels. Specifically, we extract diverse types of local features with weakly supervised learning, including object appearance and human pose representations without ground truth. Moreover, we construct a graph convolutional network for exploiting the inherent relationships of humans and objects under present incidents. We evaluate the proposed model on two datasets, the MPII-Cooking dataset and the EPIC-Kitchens dataset, and we demonstrate the generalizability and effectiveness of our approach for action anticipation.
|
Keywords
action anticipation
weakly supervised learning
relation modelling
graph convolutional network
|
Corresponding Author(s):
Wei-Shi ZHENG
|
Just Accepted Date: 19 November 2021
Issue Date: 02 August 2022
|
|
1 |
T, Mahmud M, Hasan A K Roy-Chowdhury. Joint prediction of activity labels and starting times in untrimmed videos. In: Proceedings of the IEEE International Conference on Computer Vision. 2017, 5784– 5793
|
2 |
T, Mahmud M, Billah M, Hasan A K Roy-Chowdhury. Captioning near-future activity sequences. 2019, arXiv preprint arXiv: 1908.00943
|
3 |
M, Rohrbach S, Amin M, Andriluka B Schiele. A database for fine grained activity detection of cooking activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2012, 1194– 1201
|
4 |
F, Baradel N, Neverova C, Wolf J, Mille G Mori. Object level visual reasoning in videos. In: Proceedings of the 15th European Conference on Computer Vision. 2018, 106– 122
|
5 |
M S Ryoo. Human activity prediction: early recognition of ongoing activities from streaming videos. In: Proceedings of the IEEE International Conference on Computer Vision. 2011, 1036– 1043
|
6 |
Z, Xu L, Qing J Miao. Activity auto-completion: predicting human activities from partial videos. In: Proceedings of the IEEE International Conference on Computer Vision. 2015, 3191– 3199
|
7 |
Y, Kong Y Fu . Max-margin action prediction machine. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2016, 38( 9): 1844– 1858
|
8 |
T, Lan T C, Chen S Savarese. A hierarchical representation for future action prediction. In: Proceedings of the 13th European Conference on Computer Vision. 2014, 689– 704
|
9 |
J F, Hu W S, Zheng L, Ma G, Wang J Lai. Real-time RGB-D activity prediction by soft regression. In: Proceedings of the 14th European Conference on Computer Vision. 2016, 280– 296
|
10 |
D, Tran L, Bourdev R, Fergus L, Torresani M Paluri. Learning spatiotemporal features with 3D convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision. 2015, 4489– 4497
|
11 |
J, Carreira A Zisserman. Quo vadis, action recognition? A new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017, 4724– 4733
|
12 |
Y, Kong Z, Tao Y Fu. Deep sequential context networks for action prediction. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017, 3662– 3670
|
13 |
J, Qin L, Liu L, Shao B, Ni C, Chen F, Shen Y Wang. Binary coding for partial action analysis with limited observation ratios. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017, 6728– 6737
|
14 |
D G, Lee S W Lee . Prediction of partially observed human activity based on pre-trained deep representation. Pattern Recognition, 2019, 85: 198– 206
|
15 |
M, Zolfaghari K, Singh T Brox. ECO: efficient convolutional network for online video understanding. In: Proceedings of the 15th European Conference on Computer Vision. 2018, 713– 730
|
16 |
G, Singh S, Saha M, Sapienza P, Torr F Cuzzolin. Online real-time multiple spatiotemporal action localisation and prediction. In: Proceedings of the IEEE International Conference on Computer Vision. 2017, 3657– 3666
|
17 |
S, Lai W S, Zheng J F, Hu J Zhang . Global-local temporal saliency action prediction. IEEE Transactions on Image Processing, 2018, 27( 5): 2272– 2285
|
18 |
Y, Kong S, Gao B, Sun Y Fu. Action prediction from videos via memorizing hard-to-predict samples. In: Proceedings of the 32nd AAAI Conference on Artificial Intelligence. 2018, 7000– 7007
|
19 |
C, Vondrick H, Pirsiavash A Torralba. Anticipating visual representations from unlabeled video. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016, 98– 106
|
20 |
Y, Zhong W S Zheng. Unsupervised learning for forecasting action representations. In: Proceedings of the 25th IEEE International Conference on Image Processing. 2018, 1073– 1077
|
21 |
A, Furnari G M Farinella. What would you expect? Anticipating egocentric actions with rolling-unrolling LSTMs and modality attention. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2019, 6251– 6260
|
22 |
J, Gao Z, Yang R Nevatia. RED: reinforced encoder-decoder networks for action anticipation. In: Proceedings of the British Machine Vision Conference. 2017
|
23 |
K H, Zeng W B, Shen D A, Huang M, Sun J C Niebles. Visual forecasting by imitating dynamics in natural sequences. In: Proceedings of the IEEE International Conference on Computer Vision. 2017, 3018– 3027
|
24 |
Y B, Ng B Fernando. Forecasting future action sequences with attention: a new approach to weakly supervised action forecasting. 2019, arXiv preprint arXiv: 1912.04608
|
25 |
F, Pirri L, Mauro E, Alati V, Ntouskos M, Izadpanahkakhk E Omrani. Anticipation and next action forecasting in video: an end-to-end model with memory. 2019, arXiv preprint arXiv: 1901.03728
|
26 |
J, Snell K, Swersky R Zemel. Prototypical networks for few-shot learning. In: Proceedings of the 31st Conference on Neural Information Processing Systems. 2017, 4080– 4090
|
27 |
Y A, Farha A, Richard J Gall. When will you do what?-Anticipating temporal occurrences of activities. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018, 5343– 5352
|
28 |
Q, Ke M, Fritz B Schiele. Time-conditioned action anticipation in one shot. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019, 9917– 9926
|
29 |
T Y, Wu T A, Chien C S, Chan C W, Hu M Sun. Anticipating daily intention using on-wrist motion triggered sensing. In: Proceedings of the IEEE International Conference on Computer Vision. 2017, 48– 56
|
30 |
C, Sun A, Shrivastava C, Vondrick R, Sukthankar K, Murphy C Schmid. Relational action forecasting. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019, 273– 283
|
31 |
J, Zhang M, Elhoseiny S, Cohen W, Chang A Elgammal. Relationship proposal networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017, 5226– 5234
|
32 |
H, Hu J, Gu Z, Zhang J, Dai Y Wei. Relation networks for object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018, 3588– 3597
|
33 |
G, Gkioxari R, Girshick P, Dollár K He. Detecting and recognizing human-object interactions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018, 8359– 8367
|
34 |
P, Veličković A, Casanova P, Lio G, Cucurull A, Romero Y Bengio. Graph attention networks. In: Proceedings of the 6th International Conference on Learning Representations. 2018
|
35 |
K, Kato Y, Li A Gupta. Compositional learning for human object interaction. In: Proceedings of the 15th European Conference on Computer Vision. 2018, 247– 264
|
36 |
X, Wang A Gupta. Videos as space-time region graphs. In: Proceedings of the 15th European Conference on Computer Vision. 2018, 413– 431
|
37 |
S, Qi W, Wang B, Jia J, Shen S C Zhu. Learning human-object interactions by graph parsing neural networks. In: Proceedings of the 15th European Conference on Computer Vision. 2018, 407– 423
|
38 |
Q, Zhang J, Chang G, Meng S, Xu S, Xiang C Pan . Learning graph structure via graph convolutional networks. Pattern Recognition, 2019, 95: 308– 318
|
39 |
Y W, Chao J, Yang B, Price S, Cohen J Deng. Forecasting human dynamics from static images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017, 3643– 3651
|
40 |
C, Li Z, Zhang W S, Lee G H Lee. Convolutional sequence to sequence model for human dynamics. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018, 5226– 5234
|
41 |
J, Martinez M J, Black J Romero. On human motion prediction using recurrent neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017, 4674– 4683
|
42 |
J, Bütepage M J, Black D, Kragic H Kjellström. Deep representation learning for human motion prediction and classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017, 1591– 1599
|
43 |
V, Bloom V, Argyriou D Makris . Linear latent low dimensional space for online early action recognition and prediction. Pattern Recognition, 2017, 72: 532– 547
|
44 |
J, Redmon A Farhadi. YOLOv3: an incremental improvement. 2018, arXiv preprint arXiv: 1804.02767
|
45 |
H S, Fang S, Xie Y W, Tai C Lu. RMPE: regional multi-person pose estimation. In: Proceedings of the IEEE International Conference on Computer Vision. 2017, 2353– 2362
|
46 |
Y, Xiu J, Li H, Wang Y, Fang C Lu. Pose flow: efficient online pose tracking. In: Proceedings of the British Machine Vision Conference. 2018
|
47 |
N, Dalal B, Triggs C Schmid. Human detection using oriented histograms of flow and appearance. In: Proceedings of the 9th European Conference on Computer Vision. 2006, 428– 441
|
|
Viewed |
|
|
|
Full text
|
|
|
|
|
Abstract
|
|
|
|
|
Cited |
|
|
|
|
|
Shared |
|
|
|
|
|
Discussed |
|
|
|
|