Please wait a minute...
Frontiers of Computer Science

ISSN 2095-2228

ISSN 2095-2236(Online)

CN 10-1014/TP

Postal Subscription Code 80-970

2018 Impact Factor: 1.129

Front. Comput. Sci.    2023, Vol. 17 Issue (2) : 172313    https://doi.org/10.1007/s11704-022-1167-9
RESEARCH ARTICLE
Weakly supervised action anticipation without object annotations
Yi ZHONG1, Jia-Hui PAN1, Haoxin LI1, Wei-Shi ZHENG1,2()
1. School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou 510006, China
2. Key Laboratory of Machine Intelligence and Advanced Computing, Ministry of Education, Guangzhou 510006, China
 Download: PDF(4475 KB)   HTML
 Export: BibTeX | EndNote | Reference Manager | ProCite | RefWorks
Abstract

Anticipating future actions without observing any partial videos of future actions plays an important role in action prediction and is also a challenging task. To obtain abundant information for action anticipation, some methods integrate multimodal contexts, including scene object labels. However, extensively labelling each frame in video datasets requires considerable effort. In this paper, we develop a weakly supervised method that integrates global motion and local fine-grained features from current action videos to predict next action label without the need for specific scene context labels. Specifically, we extract diverse types of local features with weakly supervised learning, including object appearance and human pose representations without ground truth. Moreover, we construct a graph convolutional network for exploiting the inherent relationships of humans and objects under present incidents. We evaluate the proposed model on two datasets, the MPII-Cooking dataset and the EPIC-Kitchens dataset, and we demonstrate the generalizability and effectiveness of our approach for action anticipation.

Keywords action anticipation      weakly supervised learning      relation modelling      graph convolutional network     
Corresponding Author(s): Wei-Shi ZHENG   
Just Accepted Date: 19 November 2021   Issue Date: 02 August 2022
 Cite this article:   
Yi ZHONG,Jia-Hui PAN,Haoxin LI, et al. Weakly supervised action anticipation without object annotations[J]. Front. Comput. Sci., 2023, 17(2): 172313.
 URL:  
https://academic.hep.com.cn/fcs/EN/10.1007/s11704-022-1167-9
https://academic.hep.com.cn/fcs/EN/Y2023/V17/I2/172313
Fig.1  Weakly supervised learning setting without object annotations. The left model [1] is under strong supervised learning setting, integrating context labels directly, such as the tools, ingredients and containers observed in the action. The right model, our weakly supervised learning setting, utilizes context cues without elaborated labels
Fig.2  Overview of our framework. Our model is divided into two types of part models for different capacities. The upper part is Global Motion branch. The upper of the local feature extractor is Appearance Relation branch and the other one is Human Skeleton branch
Fig.3  Sample frames of two datasets. The left frames are from the MPII-Cooking dataset, and the right frames are from the EPIC-Kitchens dataset
Method Top-1 accuracy/% Top-5 accuracy/%
Contexts [1] 33.1 ?
CNFAS [2] 33.7 ?
Contexts (I3D) 34.82 59.11
Ours 39.67 66.12
Tab.1  Results on the MPII-Cooking dataset
Fig.4  Qualitative evaluation on MPII-Cooking Dataset. The top-5 predicted labels are shown on the right
Components Top-1 accuracy/% Top-5 accuracy/%
I3D 34.49 60.29
POSE 37.35 64.27
GAT 37.37 60.74
I3D-POSE 37.43 64.13
I3D-GAT 37.71 65.02
GAT-POSE 38.42 64.11
Ours (full model) 39.67 66.12
Tab.2  Ablation study on the MPII-Cooking dataset
Method Top-1 accuracy Top-5 accuracy
VERB ACTION VERB ACTION
Contexts (I3D) 27.41 05.26 73.01 15.22
I3D 27.82 05.62 74.22 16.89
GAT 29.29 07.46 74.88 20.67
Ours 31.13 08.41 75.82 22.34
Tab.3  Results on the EPIC-Kitchens dataset
Method Top-1 accuracy Top-5 accuracy
VERB ACTION VERB ACTION
S1 Contexts (I3D) 27.40 04.96 70.57 15.09
I3D 28.42 05.59 72.23 16.20
GAT 29.89 06.33 74.36 19.16
Ours 31.27 08.13 74.86 21.54
S2 Contexts (I3D) 23.52 03.79 62.92 10.07
I3D 24.96 04.03 66.00 11.98
GAT 25.50 04.44 65.79 11.78
Ours 27.04 05.50 67.12 14.27
Tab.4  Results on the EPIC-Kitchens test sets
Fig.5  Qualitative evaluation on EPIC-Kitchens Dataset. The top-5 predicted labels are shown on the right
P MPII-Cooking EPIC-Kitchens
Top-1 Acc. Top-5 Acc. Top-1 Acc. Top-5 Acc.
VERB ACTION VERB ACTION
1 37.41 64.73 28.36 06.78 74.61 18.63
3 39.67 66.12 31.13 08.41 75.82 22.34
5 38.05 65.06 30.49 07.84 75.34 21.31
7 37.46 63.60 30.55 08.18 75.99 22.13
9 37.95 63.85 30.56 08.04 75.76 22.16
Tab.5  Results of different frame selections on two datasets. P: the number of selection frames on appearance relation graph learning
K MPII-Cooking EPIC-Kitchens
Top-1 Acc. Top-5 Acc. Top-1 Acc. Top-5 Acc.
VERB ACTION VERB ACTION
3 37.64 65.34 31.13 08.41 75.82 22.34
5 39.67 66.12 30.27 07.68 75.68 21.10
7 38.20 66.00 30.64 08.06 75.78 21.76
Obj 38.55 65.42 29.46 06.77 74.88 18.79
Tab.6  Results of different proposal number selections on two datasets. K: the number of proposals in one frame on appearance relation graph learning. Obj: We concatenate proposal appearance features to replace the relational graph in each frame in our framework
M MPII-Cooking EPIC-Kitchens
Top-1 Acc. Top-5 Acc. Top-1 Acc. Top-5 Acc.
VERB ACTION VERB ACTION
R 38.65 65.36 30.24 08.03 76.07 22.05
F 39.67 66.12 31.13 08.41 75.82 22.34
R+F 38.58 65.69 30.47 08.45 75.90 22.15
Tab.7  Results of different modality selections on two dataset. R: RGB modality. F: Optical Flow modality
fg+ ff MPII-Cooking EPIC-Kitchens
Top-1 Acc. Top-5 Acc. Top-1 Acc. Top-5 Acc.
VERB ACTION VERB ACTION
Pe+Me 38.86 66.40 28.83 08.43 75.66 21.77
Pe+Ma 39.02 66.48 28.95 07.55 74.90 20.58
Pe+Fc 38.95 65.73 31.13 08.41 75.82 22.34
Fc+Me 38.00 64.87 29.34 08.41 75.67 22.26
Fc+Ma 38.08 64.52 29.19 08.10 75.51 21.77
Fc+Fc 39.67 66.12 30.42 07.80 75.82 21.32
Tab.8  Results of different fusion methods on two datasets. Pe: node feature that represents person proposal bounding box. Me: mean pooling layer. Ma: max pooling layer. Fc: a fully connected layer. Pe+Me means that we choose node feature that represents a person proposal bounding box as the output of graph feature fusion function fgraph_fus(?) and apply a mean pooling layer as the frame-level fusion function fframe_fus(?)
  
  
  
  
1 T, Mahmud M, Hasan A K Roy-Chowdhury. Joint prediction of activity labels and starting times in untrimmed videos. In: Proceedings of the IEEE International Conference on Computer Vision. 2017, 5784– 5793
2 T, Mahmud M, Billah M, Hasan A K Roy-Chowdhury. Captioning near-future activity sequences. 2019, arXiv preprint arXiv: 1908.00943
3 M, Rohrbach S, Amin M, Andriluka B Schiele. A database for fine grained activity detection of cooking activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2012, 1194– 1201
4 F, Baradel N, Neverova C, Wolf J, Mille G Mori. Object level visual reasoning in videos. In: Proceedings of the 15th European Conference on Computer Vision. 2018, 106– 122
5 M S Ryoo. Human activity prediction: early recognition of ongoing activities from streaming videos. In: Proceedings of the IEEE International Conference on Computer Vision. 2011, 1036– 1043
6 Z, Xu L, Qing J Miao. Activity auto-completion: predicting human activities from partial videos. In: Proceedings of the IEEE International Conference on Computer Vision. 2015, 3191– 3199
7 Y, Kong Y Fu . Max-margin action prediction machine. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2016, 38( 9): 1844– 1858
8 T, Lan T C, Chen S Savarese. A hierarchical representation for future action prediction. In: Proceedings of the 13th European Conference on Computer Vision. 2014, 689– 704
9 J F, Hu W S, Zheng L, Ma G, Wang J Lai. Real-time RGB-D activity prediction by soft regression. In: Proceedings of the 14th European Conference on Computer Vision. 2016, 280– 296
10 D, Tran L, Bourdev R, Fergus L, Torresani M Paluri. Learning spatiotemporal features with 3D convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision. 2015, 4489– 4497
11 J, Carreira A Zisserman. Quo vadis, action recognition? A new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017, 4724– 4733
12 Y, Kong Z, Tao Y Fu. Deep sequential context networks for action prediction. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017, 3662– 3670
13 J, Qin L, Liu L, Shao B, Ni C, Chen F, Shen Y Wang. Binary coding for partial action analysis with limited observation ratios. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017, 6728– 6737
14 D G, Lee S W Lee . Prediction of partially observed human activity based on pre-trained deep representation. Pattern Recognition, 2019, 85: 198– 206
15 M, Zolfaghari K, Singh T Brox. ECO: efficient convolutional network for online video understanding. In: Proceedings of the 15th European Conference on Computer Vision. 2018, 713– 730
16 G, Singh S, Saha M, Sapienza P, Torr F Cuzzolin. Online real-time multiple spatiotemporal action localisation and prediction. In: Proceedings of the IEEE International Conference on Computer Vision. 2017, 3657– 3666
17 S, Lai W S, Zheng J F, Hu J Zhang . Global-local temporal saliency action prediction. IEEE Transactions on Image Processing, 2018, 27( 5): 2272– 2285
18 Y, Kong S, Gao B, Sun Y Fu. Action prediction from videos via memorizing hard-to-predict samples. In: Proceedings of the 32nd AAAI Conference on Artificial Intelligence. 2018, 7000– 7007
19 C, Vondrick H, Pirsiavash A Torralba. Anticipating visual representations from unlabeled video. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016, 98– 106
20 Y, Zhong W S Zheng. Unsupervised learning for forecasting action representations. In: Proceedings of the 25th IEEE International Conference on Image Processing. 2018, 1073– 1077
21 A, Furnari G M Farinella. What would you expect? Anticipating egocentric actions with rolling-unrolling LSTMs and modality attention. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2019, 6251– 6260
22 J, Gao Z, Yang R Nevatia. RED: reinforced encoder-decoder networks for action anticipation. In: Proceedings of the British Machine Vision Conference. 2017
23 K H, Zeng W B, Shen D A, Huang M, Sun J C Niebles. Visual forecasting by imitating dynamics in natural sequences. In: Proceedings of the IEEE International Conference on Computer Vision. 2017, 3018– 3027
24 Y B, Ng B Fernando. Forecasting future action sequences with attention: a new approach to weakly supervised action forecasting. 2019, arXiv preprint arXiv: 1912.04608
25 F, Pirri L, Mauro E, Alati V, Ntouskos M, Izadpanahkakhk E Omrani. Anticipation and next action forecasting in video: an end-to-end model with memory. 2019, arXiv preprint arXiv: 1901.03728
26 J, Snell K, Swersky R Zemel. Prototypical networks for few-shot learning. In: Proceedings of the 31st Conference on Neural Information Processing Systems. 2017, 4080– 4090
27 Y A, Farha A, Richard J Gall. When will you do what?-Anticipating temporal occurrences of activities. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018, 5343– 5352
28 Q, Ke M, Fritz B Schiele. Time-conditioned action anticipation in one shot. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019, 9917– 9926
29 T Y, Wu T A, Chien C S, Chan C W, Hu M Sun. Anticipating daily intention using on-wrist motion triggered sensing. In: Proceedings of the IEEE International Conference on Computer Vision. 2017, 48– 56
30 C, Sun A, Shrivastava C, Vondrick R, Sukthankar K, Murphy C Schmid. Relational action forecasting. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019, 273– 283
31 J, Zhang M, Elhoseiny S, Cohen W, Chang A Elgammal. Relationship proposal networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017, 5226– 5234
32 H, Hu J, Gu Z, Zhang J, Dai Y Wei. Relation networks for object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018, 3588– 3597
33 G, Gkioxari R, Girshick P, Dollár K He. Detecting and recognizing human-object interactions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018, 8359– 8367
34 P, Veličković A, Casanova P, Lio G, Cucurull A, Romero Y Bengio. Graph attention networks. In: Proceedings of the 6th International Conference on Learning Representations. 2018
35 K, Kato Y, Li A Gupta. Compositional learning for human object interaction. In: Proceedings of the 15th European Conference on Computer Vision. 2018, 247– 264
36 X, Wang A Gupta. Videos as space-time region graphs. In: Proceedings of the 15th European Conference on Computer Vision. 2018, 413– 431
37 S, Qi W, Wang B, Jia J, Shen S C Zhu. Learning human-object interactions by graph parsing neural networks. In: Proceedings of the 15th European Conference on Computer Vision. 2018, 407– 423
38 Q, Zhang J, Chang G, Meng S, Xu S, Xiang C Pan . Learning graph structure via graph convolutional networks. Pattern Recognition, 2019, 95: 308– 318
39 Y W, Chao J, Yang B, Price S, Cohen J Deng. Forecasting human dynamics from static images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017, 3643– 3651
40 C, Li Z, Zhang W S, Lee G H Lee. Convolutional sequence to sequence model for human dynamics. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018, 5226– 5234
41 J, Martinez M J, Black J Romero. On human motion prediction using recurrent neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017, 4674– 4683
42 J, Bütepage M J, Black D, Kragic H Kjellström. Deep representation learning for human motion prediction and classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017, 1591– 1599
43 V, Bloom V, Argyriou D Makris . Linear latent low dimensional space for online early action recognition and prediction. Pattern Recognition, 2017, 72: 532– 547
44 J, Redmon A Farhadi. YOLOv3: an incremental improvement. 2018, arXiv preprint arXiv: 1804.02767
45 H S, Fang S, Xie Y W, Tai C Lu. RMPE: regional multi-person pose estimation. In: Proceedings of the IEEE International Conference on Computer Vision. 2017, 2353– 2362
46 Y, Xiu J, Li H, Wang Y, Fang C Lu. Pose flow: efficient online pose tracking. In: Proceedings of the British Machine Vision Conference. 2018
47 N, Dalal B, Triggs C Schmid. Human detection using oriented histograms of flow and appearance. In: Proceedings of the 9th European Conference on Computer Vision. 2006, 428– 441
[1] FCS-21167-OF-YZ_suppl_1 Download
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed