Please wait a minute...
Frontiers of Computer Science

ISSN 2095-2228

ISSN 2095-2236(Online)

CN 10-1014/TP

Postal Subscription Code 80-970

2018 Impact Factor: 1.129

Front. Comput. Sci.    2024, Vol. 18 Issue (5) : 185325    https://doi.org/10.1007/s11704-023-2418-0
RESEARCH ARTICLE
Contactless interaction recognition and interactor detection in multi-person scenes
Jiacheng LI1, Ruize HAN1(), Wei FENG1, Haomin YAN1, Song WANG2
1. College of Intelligence and Computing, Tianjin University, Tianjin 300350, China
2. Department of Computer Science and Engineering, University of South Carolina, Columbia SC 29208, USA
 Download: PDF(2624 KB)   HTML
 Export: BibTeX | EndNote | Reference Manager | ProCite | RefWorks
Abstract

Human interaction recognition is an essential task in video surveillance. The current works on human interaction recognition mainly focus on the scenarios only containing the close-contact interactive subjects without other people. In this paper, we handle more practical but more challenging scenarios where interactive subjects are contactless and other subjects not involved in the interactions of interest are also present in the scene. To address this problem, we propose an Interactive Relation Embedding Network (IRE-Net) to simultaneously identify the subjects involved in the interaction and recognize their interaction category. As a new problem, we also build a new dataset with annotations and metrics for performance evaluation. Experimental results on this dataset show significant improvements of the proposed method when compared with current methods developed for human interaction recognition and group activity recognition.

Keywords human-human interaction recognition      multiperson scene      contactless interaction      human relation modeling     
Corresponding Author(s): Ruize HAN   
Just Accepted Date: 01 June 2023   Issue Date: 04 August 2023
 Cite this article:   
Jiacheng LI,Ruize HAN,Wei FENG, et al. Contactless interaction recognition and interactor detection in multi-person scenes[J]. Front. Comput. Sci., 2024, 18(5): 185325.
 URL:  
https://academic.hep.com.cn/fcs/EN/10.1007/s11704-023-2418-0
https://academic.hep.com.cn/fcs/EN/Y2024/V18/I5/185325
Fig.1  An illustration of different interactive activities. (a) “Pushing” in UT interaction dataset and (b) “Shaking” in AVA dataset. (c?d) Contactless interactive activities in the multi-person scene that are studied in this paper, where red bounding boxes indicate the interactive subjects
Fig.2  Illustration of the proposed method for contactless interactive subject identification and interaction recognition
Fig.3  Illustration of interaction prediction via IRE-Net. In the XOY plane, the three sub-tasks in this problem are modeled as a point, a line, and a face (with red color) of the proposed relation cube, respectively
Fig.4  Example cases of missed subjects filling in the proposed method. Here, a colored ball represents a feature vector of one subject at a frame. The empty block means the feature vector of one subject at this time is missing. The arrow means filling one feature to another place. The short bold line means the historical/future average feature
Dataset # Videos # Frames # Interactors
CO 80 15,064 30,128
GO 80 14,176 28,343
GR 80 8,754 17,508
CH 80 19,242 38,286
TA 80 8,914 17,800
PP 80 37,484 74,776
Training 240 52,500 104,887
Testing 240 51,134 101,954
Full 480 103,634 206,841
Tab.1  Statistics of the proposed dataset
Dataset # Type Ratio # Subjects
UT-interaction [6] 1 0.17 ~2
ShakeFive2 [15] 0 0 2
SBU Kinect [7] 2 0.25 2
AVA [8] 2 0.25 1.8
Ours 6 1 9.6
Tab.2  Statistics and comparison of the proposed dataset and others
Method Interactor Ind. Vid. Interaction Rec. Sub. Interaction Rec. Overall
P R F P R F P R F MHIA
Chance 15.1 6.3 8.8 15.6 15.8 15.6 2.3 13.6 4.0 4.9
X3D [62] ? ? ? 60.4 22.4 32.5 ? ? ? ?
SlowFast [63] ? ? ? 59.3 21.1 30.9 ? ? ? ?
SlowFast w box [63] 12.5 13.2 12.2 45.1 44.2 41.6 12.4 16.3 14.1 12.0
ARG [45] 14.6 75.3 24.3 58.7 58.8 58.2 8.8 15.2 11.2 18.2
GR2N [48] 53.9 53.3 53.6 51.3 52.9 50.3 17.9 55.3 27.1 40.0
GPNN [50] 20.0 20.6 20.3 55.1 52.5 48.2 9.1 57.5 15.7 15.7
HiGCIN [64] 50.4 51.9 51.1 63.1 61.2 60.8 21.9 54.6 31.2 41.4
Dynamic [65] 40.3 41.5 40.9 55.5 55.8 53.1 19.6 27.4 22.8 30.4
Ours 65.2 48.1 55.3 63.8 65.0 64.2 40.0 42.0 41.0 44.2
Tab.3  Comparative results of interactor identification, video interaction recognition and subject interaction recognition (%)
Method Interactor Ind. Vid. Interaction Rec. Sub. Interaction Rec. Overall
P R F P R F P R F MHIA
w/o Spatial 64.1 46.3 53.7 60.4 62.5 60.1 35.4 45.6 39.8 42.8
w/o Appearance 41.5 41.9 41.7 58.1 57.9 57.6 15.7 58.6 24.7 38.2
w/o Long-term Agr. 54.9 43.0 48.2 63.2 63.3 62.8 24.7 52.0 33.5 39.2
w/o Short-term Sap. 62.3 47.5 53.4 64.7 64.6 63.7 38.6 41.2 39.8 43.7
w/o Cube 21.1 21.7 21.4 65.0 65.4 65.0 38.0 39.8 38.9 16.2
w/o Filling 66.2 46.2 54.4 64.1 64.6 63.5 26.7 54.3 35.8 42.7
w/o triplet 66.1 50.4 57.2 60.2 61.3 60.3 49.8 28.6 36.3 43.9
w R weight 61.6 45.2 52.1 57.7 58.3 55.9 36.2 42.4 39.0 43.3
Ours 65.2 48.1 55.3 63.8 65.0 64.2 40.0 42.0 41.0 44.2
Tab.4  Ablation study results of the proposed method for the three tasks (%)
  
  
  
  
  
1 J, Zhao R, Han Y, Gan L, Wan W, Feng S Wang . Human identification and interaction detection in cross-view multi-person videos with wearable cameras. In: Proceedings of the 28th ACM International Conference on Multimedia. 2020
2 G, Li W, Qu Q Huang . A multiple targets appearance tracker based on object interaction models. IEEE Transactions on Circuits and Systems for Video Technology, 2012, 22( 3): 450–464
3 J, Liang L, Jiang J C, Niebles A G, Hauptmann F F Li . Peeking into the future: predicting future person activities and locations in videos. In: Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2019
4 R, Mehran A, Oyama M Shah . Abnormal crowd behavior detection using social force model. In: Proceedings of 2009 IEEE Conference on Computer Vision and Pattern Recognition. 2009
5 R, Han J, Zhao W, Feng Y, Gan L, Wan S Wang . Complementary-view co-interest person detection. In: Proceedings of the 28th ACM International Conference on Multimedia. 2020
6 M S, Ryoo J K Aggarwal . Interaction dataset, ICPR 2010 contest on semantic description of human activities (SDHA 2010). See , 2010
7 K, Yun J, Honorio D, Chattopadhyay T L, Berg D Samaras . Two-person interaction detection using body-pose features and multiple instance learning. In: Proceedings of 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops. 2012
8 C, Gu C, Sun D A, Ross C, Vondrick C, Pantofaru Y, Li S, Vijayanarasimhan G, Toderici S, Ricco R, Sukthankar C, Schmid J Malik . AVA: a video dataset of spatio-temporally localized atomic visual actions. In: Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018
9 R, Han W, Feng Y, Zhang J, Zhao S Wang . Multiple human association and tracking from egocentric and complementary top views. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022, 44( 9): 5225–5242
10 R, Han Y, Zhang W, Feng C, Gong X, Zhang J, Zhao L, Wan S Wang . Multiple human association between top and horizontal views by matching subjects’ spatial distributions. 2019, arXiv preprint arXiv: 1907.11458
11 Han R, Feng W, Zhao J, Niu Z, Zhang Y, Wan L, Wang S. Complementary-view multiple human tracking. In: Proceedings of the 34th AAAI Conference on Artificial Intelligence. 2020
12 J, Carreira E, Noland C, Hillier A Zisserman . A short note on the kinetics-700 human action dataset. 2019, arXiv preprint arXiv: 1907.06987
13 W, Kay J, Carreira K, Simonyan B, Zhang C, Hillier S, Vijayanarasimhan F, Viola T, Green T, Back P, Natsev M, Suleyman A Zisserman . The kinetics human action video dataset. 2017, arXiv preprint arXiv: 1907.06987
14 Kong Y, Jia Y, Fu Y. Learning human interaction by interactive phrases. In: Proceedings of the 12th European Conference on Computer Vision. 2012
15 Van Gemeren C, Poppe R, Veltkamp R C. Spatio-temporal detection of fine-grained dyadic human interactions. In: Proceedings of the 7th International Workshop on Human Behavior Understanding. 2016
16 Taylor G W, Fergus R, LeCun Y, Bregler C. Convolutional learning of spatio-temporal features. In: Proceedings of the 11th European Conference on Computer Vision. 2010
17 D, Tran L, Bourdev R, Fergus L, Torresani M Paluri . Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of 2015 IEEE International Conference on Computer Vision (ICCV). 2015
18 J, Carreira A Zisserman . Quo vadis, action recognition? A new model and the kinetics dataset. In: Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2017
19 C, Zhang Y, Zou G, Chen L Gan . PAN: persistent appearance network with an efficient motion cue for fast action recognition. In: Proceedings of the 27th ACM International Conference on Multimedia. 2019
20 Z, Wang S, Liu J, Zhang S, Chen Q Guan . A spatio-temporal crf for human interaction understanding. IEEE Transactions on Circuits and Systems for Video Technology, 2017, 27( 8): 1647–1660
21 S, Motiian F, Siyahjani R, Almohsen G Doretto . Online human interaction detection and recognition with multiple cameras. IEEE Transactions on Circuits and Systems for Video Technology, 2017, 27( 3): 649–663
22 Song S, Lan C, Xing J, Zeng W, Liu J. An end-to-end spatio-temporal attention model for human action recognition from skeleton data. In: Proceedings of the 31st AAAI Conference on Artificial Intelligence. 2017
23 X, Gao W, Hu J, Tang J, Liu Z Guo . Optimized skeleton-based action recognition via sparsified graph regression. In: Proceedings of the 27th ACM International Conference on Multimedia. 2019
24 Y, Tang Y, Tian J, Lu P, Li J Zhou . Deep progressive reinforcement learning for skeleton-based action recognition. In: Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018
25 Z, Wang J, Ge D, Guo J, Zhang Y, Lei S Chen . Human interaction understanding with joint graph decomposition and node labeling. IEEE Transactions on Image Processing, 2021, 30: 6240–6254
26 C, Feichtenhofer A, Pinz R P Wildes . Spatiotemporal residual networks for video action recognition. In: Proceedings of the 30th International Conference on Neural Information Processing Systems. 2016
27 D, Tran H, Wang L, Torresani J, Ray Y, LeCun M Paluri . A closer look at spatiotemporal convolutions for action recognition. In: Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018
28 Z, Qiu T, Yao T Mei . Learning spatio-temporal representation with pseudo-3D residual networks. In: Proceedings of 2017 IEEE International Conference on Computer Vision (ICCV). 2017
29 H, Wang C Schmid . Action recognition with improved trajectories. In: Proceedings of 2013 IEEE International Conference on Computer Vision. 2013
30 L, Wang Y, Qiao X Tang . Action recognition with trajectory-pooled deep-convolutional descriptors. In: Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2015
31 D G, Lee S W Lee . Human interaction recognition framework based on interacting body part attention. Pattern Recognition, 2022, 128: 108645
32 H, Tu R, Xu R, Chi Y Peng . Multiperson interactive activity recognition based on interaction relation model. Journal of Mathematics, 2021, 2021: 5576369
33 A, Verma T, Meenpal B Acharya . Multiperson interaction recognition in images: a body keypoint based feature image analysis. Computational Intelligence, 2021, 37( 1): 461–483
34 A, Patron-Perez M, Marszalek I, Reid A Zisserman . Structured learning of human interactions in TV shows. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2012, 34( 12): 2441–2453
35 H, Zhao A, Torralba L, Torresani Z Yan . HACS: human action clips and segments dataset for recognition and temporal localization. In: Proceedings of 2019 IEEE/CVF International Conference on Computer Vision (ICCV). 2019
36 H, Joo H, Liu L, Tan L, Gui B, Nabbe I, Matthews T, Kanade S, Nobuhara Y Sheikh . Panoptic studio: a massively multiview system for social motion capture. In: Proceedings of 2015 IEEE International Conference on Computer Vision (ICCV). 2015
37 M, Ehsanpour F, Saleh S, Savarese I, Reid H Rezatofighi . JRDB-Act: a large-scale dataset for spatio-temporal action, social group and activity detection. In: Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2022
38 Li J, Han R, Yan H, Qian Z, Feng W, Wang S. Self-supervised social relation representation for human group detection. In: Proceedings of the 17th European Conference on Computer Vision. 2022
39 Han R, Yan H, Li J, Wang S, Feng W, Wang S. Panoramic human activity recognition. In: Proceedings of the 17th European Conference on Computer Vision. 2022
40 T, Shu S, Todorovic S C Zhu . CERN: confidence-energy recurrent network for group activity recognition. In: Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2017
41 X, Shu J, Tang G, Qi W, Liu J Yang . Hierarchical long short-term concurrent memory for human interaction recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019, 43( 3): 1110–1118
42 P, Zhang Y, Tang J F, Hu W S Zheng . Fast collective activity recognition under weak supervision. IEEE Transactions on Image Processing, 2020, 29: 29–43
43 Yuan H, Ni D. Learning visual context for group activity recognition. In: Proceedings of the 35th AAAI Conference on Artificial Intelligence. 2021
44 R, Yan J, Tang X, Shu Z, Li Q Tian . Participation-contributed temporal dynamic model for group activity recognition. In: Proceedings of the 26th ACM International Conference on Multimedia. 2018
45 J, Wu L, Wang L, Wang J, Guo G Wu . Learning actor relation graphs for group activity recognition. In: Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2019
46 Choi W, Shahid K, Savarese S. What are they doing?: Collective activity classification using spatio-temporal relationship among people. In: Proceedings of the 12th IEEE International Conference on Computer Vision Workshops, ICCV Workshops. 2009
47 M S, Ibrahim S, Muralidharan Z, Deng A, Vahdat G Mori . A hierarchical deep temporal model for group activity recognition. In: Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2016
48 Li W, Duan Y, Lu J, Feng J, Zhou J. Graph-based social relation reasoning. In: Proceedings of the 16th European Conference on Computer Vision. 2020
49 J, Li Y, Wong Q, Zhao M S Kankanhalli . Visual social relationship recognition. International Journal of Computer Vision, 2020, 128( 6): 1750–1764
50 Qi S, Wang W, Jia B, Shen J, Zhu S C. Learning human-object interactions by graph parsing neural networks. In: Proceedings of the 15th European Conference on Computer Vision. 2018
51 X, Zhong C, Ding X, Qu D Tao . Polysemy deciphering network for robust human–object interaction detection. International Journal of Computer Vision, 2021, 129( 6): 1910–1929
52 Qiao T, Men Q, Li F W, Kubotani Y, Morishima S, Shum H P H. Geometric features informed multi-person human-object interaction recognition in videos. In: Proceedings of the 17th European Conference on Computer Vision. 2022
53 L, Bai F, Chen Y Tian . Automatically detecting human-object interaction by an instance part-level attention deep framework. Pattern Recognition, 2023, 134: 109110
54 Li F, Wang S, Wang S, Zhang L. Human-object interaction detection: a survey of deep learning-based methods. In: Proceedings of the 2nd CAAI International Conference on Artificial Intelligence. 2022
55 M, Antoun D Asmar . Human object interaction detection: design and survey. Image and Vision Computing, 2023, 130: 104617
56 J, Lim V M, Baskaran J M Y, Lim K, Wong J, See M Tistarelli . ERNet: an efficient and reliable human-object interaction detection network. IEEE Transactions on Image Processing, 2023, 32: 964–979
57 Glorot X, Bengio Y. Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the 13th International Conference on Artificial Intelligence and Statistics. 2010
58 C, Szegedy V, Vanhoucke S, Ioffe J, Shlens Z Wojna . Rethinking the inception architecture for computer vision. In: Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2016
59 K M, He G, Gkioxari P, Dollár R Girshick . Mask R-CNN. In: Proceedings of 2017 IEEE International Conference on Computer Vision (ICCV). 2017
60 F, Schroff D, Kalenichenko J Philbin . FaceNet: a unified embedding for face recognition and clustering. In: Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2015
61 Y, Zhang C, Wang X, Wang W, Zeng W Liu . FairMOT: on the fairness of detection and re-identification in multiple object tracking. International Journal of Computer Vision, 2021, 129( 11): 3069–3087
62 C Feichtenhofer . X3D: expanding architectures for efficient video recognition. In: Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2020
63 C, Feichtenhofer H, Fan J, Malik K He . SlowFast networks for video recognition. In: Proceedings of 2019 IEEE/CVF International Conference on Computer Vision (ICCV). 2019
64 R, Yan L, Xie J, Tang X, Shu Q Tian . HiGCIN: hierarchical graph-based cross inference network for group activity recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45( 6): 6955–6968
65 H, Yuan D, Ni M Wang . Spatio-temporal dynamic inference network for group activity recognition. In: Proceedings of 2021 IEEE/CVF International Conference on Computer Vision (ICCV). 2021
66 Wang L, Xiong Y, Wang Z, Qiao Y, Lin D, Tang X, Van Gool L. Temporal segment networks: towards good practices for deep action recognition. In: Proceedings of the 14th European Conference on Computer Vision. 2016
67 Han R, Gan Y, Li J, Wang F, Feng W, Wang S. Connecting the complementary-view videos: joint camera identification and subject association. In: Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022
68 R, Han Y, Gan L, Wang N, Li W, Feng S Wang . Relating view directions of complementary-view mobile cameras via the human shadow. International Journal of Computer Vision, 2023, 131( 5): 1106–1121
[1] FCS-22418-OF-JL_suppl_1 Download
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed