Please wait a minute...
Frontiers of Computer Science

ISSN 2095-2228

ISSN 2095-2236(Online)

CN 10-1014/TP

Postal Subscription Code 80-970

2018 Impact Factor: 1.129

Front. Comput. Sci.    2025, Vol. 19 Issue (7) : 197329    https://doi.org/10.1007/s11704-024-40387-w
Artificial Intelligence
Cascade context-oriented spatio-temporal attention network for efficient and fine-grained video-grounded dialogues
Hao WANG, Bin GUO(), Mengqi CHEN, Qiuyun ZHANG, Yasan DING, Ying ZHANG, Zhiwen YU
School of Computer Science, Northwestern Polytechnical University, Xi’an 710129, China
 Download: PDF(3794 KB)   HTML
 Export: BibTeX | EndNote | Reference Manager | ProCite | RefWorks
Abstract

Video-Grounded Dialogue System (VGDS), focusing on generating reasonable responses based on multi-turn dialogue contexts and a given video, has received intensive attention recently. The key to building a superior VGDS lies in efficiently reasoning over visual and textual concepts of various granularities and achieving comprehensive visual-textual multi-modality alignment. Despite remarkable research progress, existing studies suffer from identifying context-relevant video parts while disregarding the impact of redundant information in long-form and content-dynamic videos. Further, current methods usually align all semantics in different modalities uniformly using a one-time cross-attention scheme, which neglects the sophisticated correspondence between various granularities of visual and textual concepts (e.g., still objects with nouns, dynamic events with verbs). To this end, we propose a novel system, namely Cascade cOntext-oriented Spatio-Temporal Attention Network (COSTA), to generate reasonable responses efficiently and accurately. Specifically, COSTA first adopts a cascade attention network to localize only the most relevant video clips and regions in a coarse-to-fine manner which effectively filters the irrelevant visual semantics. Secondly, we design a memory distillation-inspired iterative visual-textual cross-attention strategy to progressively integrate visual semantics with dialogue contexts across varying granularities, facilitating extensive multi-modal alignment. Experiments on several benchmarks demonstrate significant improvements in our model over state-of-the-art methods across various metrics.

Keywords video-grounded dialogue      spatio-temporal attention      multi-modality     
Corresponding Author(s): Bin GUO   
Just Accepted Date: 09 July 2024   Issue Date: 23 September 2024
 Cite this article:   
Hao WANG,Bin GUO,Mengqi CHEN, et al. Cascade context-oriented spatio-temporal attention network for efficient and fine-grained video-grounded dialogues[J]. Front. Comput. Sci., 2025, 19(7): 197329.
 URL:  
https://academic.hep.com.cn/fcs/EN/10.1007/s11704-024-40387-w
https://academic.hep.com.cn/fcs/EN/Y2025/V19/I7/197329
Fig.1  Illustration of the system definition and two problems in VGDS: 1) Multi-aspect reasoning: reasonable responses must be jointly reasoned from e.g., spatial (“where”), temporal (“beginning”), causal (“why”), multi-events (“opening the box” after “drinking the water”) aspects. 2) Multi-modality alignment: textual references (e.g., “his mouth”) need to be accurately connected to corresponding nouns in dialogue history (e.g., “a man”) and visual concepts in the video (e.g., “the man” who appears in multiple frames)
Fig.2  The overall architecture of our proposed COSTA, which is composed of (1) the input encoders, consisting of a video encoder and a text encoder, (2) a cascade spatio-temporal attention network with two cascade steps (coarse-grained temporal filtering and fine-grained spatial filtering), (3) a memory distillation-inspired iterative visual-textual cross-attention module, and (4) a cross-modal response decoder
Parameter CountGPU MemoryRAMFLOPs (G)
DialogMCF0.12 B3.7 GB3.1 GB166.83
THAM0.22 B7.0 GB5.7 GB198.29
COSTA0.10 B 3.13GB 2.3 GB89.37
Video LLaMA7 B22.1 GB7.5 GB1768.66
LLaMA Adapter7 B22.4 GB6.9 GB1712.05
Video Chat7 B23.6 GB8.1 GB1844.58
Video-ChatGPT7 B21.8 GB7.4 GB1735.24
Tab.1  The computational complexity and resource consumption of models. B represents one billion parameters. FLOPs (G) represents giga floating point operations
Models BLEU1 BLEU2 BLEU3 BLEU4 METEOR ROUGE-L CIDEr BERTScore
AVSD@DSTC7
SCGA 0.745 0.622 0.517 0.430 0.285 0.578 1.201 0.515
VGMNM ? ? ? 0.429 0.278 0.578 1.188 0.530
PDC 0.747 0.616 0.512 0.429 0.282 0.579 1.194 0.539
BiST 0.755 0.619 0.510 0.429 0.284 0.581 1.192 0.552
PDC-GPT2 0.770 0.653 0.539 0.449 0.292 0.606 1.295 0.579
RLM 0.765 0.643 0.543 0.459 0.294 0.606 1.308 0.583
CRMSG 0.776 0.652 0.551 0.466 0.304 0.609 1.333 0.588
DialogMCF 0.777 0.653 0.547 0.457 0.306 0.613 1.352 0.592
THAM 0.778 0.654 0.549 0.468 0.308 0.619 1.335 0.595
COSTA 0.793 0.657 0.564 0.490 0.315 0.642 1.388 0.641
COSTA-BLIP 0.819 0.673 0.580 0.508 0.325 0.649 1.410 0.663
COSTA-Frozen 0.804 0.664 0.588 0.506 0.318 0.661 1.442 0.651
Video LLaMA 0.780 0.642 0.558 0.483 0.308 0.633 1.373 0.625
w/ COSTA 0.805 0.669 0.570 0.504 0.317 0.660 1.419 0.652
LLaMA Adapter 0.790 0.656 0.562 0.488 0.316 0.641 1.384 0.641
w/ COSTA 0.817 0.674 0.586 0.512 0.327 0.668 1.440 0.666
Video Chat 0.795 0.661 0.565 0.493 0.317 0.644 1.394 0.647
w/ COSTA 0.827 0.680 0.597 0.523 0.327 0.670 1.445 0.686
Video-ChatGPT 0.798 0.667 0.570 0.502 0.322 0.655 1.427 0.661
w/ COSTA 0.840 0.696 0.608 0.541 0.332 0.678 1.553 0.719
AVSD@DSTC8
SCGA 0.711 0.593 0.497 0.416 0.276 0.566 1.123 0.554
PDC 0.723 0.595 0.493 0.410 0.270 0.570 1.105 0.562
PDC-GPT2 0.749 0.629 0.528 0.439 0.285 0.592 1.201 0.589
RLM 0.746 0.626 0.528 0.445 0.286 0.598 1.240 0.587
DialogMCF 0.756 0.633 0.532 0.449 0.293 0.601 1.253 0.595
THAM 0.764 0.641 0.538 0.455 0.301 0.610 1.304 0.602
COSTA 0.776 0.654 0.550 0.473 0.307 0.631 1.353 0.633
COSTA-BLIP 0.799 0.663 0.577 0.490 0.318 0.643 1.385 0.658
COSTA-Frozen 0.800 0.659 0.568 0.484 0.311 0.652 1.425 0.645
Video LLaMA 0.761 0.642 0.545 0.469 0.303 0.620 1.345 0.619
w/ COSTA 0.796 0.660 0.566 0.489 0.309 0.651 1.394 0.647
LLaMA Adapter 0.774 0.651 0.544 0.469 0.307 0.632 1.349 0.628
w/ COSTA 0.806 0.665 0.573 0.495 0.314 0.653 1.428 0.659
Video Chat 0.780 0.660 0.551 0.485 0.309 0.637 1.355 0.643
w/ COSTA 0.816 0.668 0.594 0.502 0.325 0.667 1.433 0.674
Video-ChatGPT 0.783 0.665 0.557 0.495 0.315 0.649 1.384 0.651
w/ COSTA 0.837 0.689 0.600 0.532 0.330 0.657 1.478 0.707
Tab.2  Evaluation results of on the test set of AVSD@DSTC7 and AVSD@DSTC8
Accuracy (%)HRNNVDTNMTNCOSTA
Action count36.0038.7838.8045.22
Action query38.6039.3739.4045.09
Attribute query45.1042.9343.1053.29
Compare action seq57.5061.5761.6069.35
Compare action set44.3045.4145.4058.17
Compare action freq65.2066.4267.1068.74
None45.1043.5143.4053.17
Atomic (non-spatial)50.7048.8848.9058.51
Atomic (spatial)47.6047.1247.1051.66
Compositional51.4053.1853.2057.82
Transfer (attribute)57.3057.7057.7063.34
Transfer (spatial)47.4047.8648.0052.01
Transfer (temporal)64.6068.7269.0076.24
All50.2051.0251.1058.33
Tab.3  Evaluation results on the test set of DVD
Models Language Fluency Context Coherence Factual Correctness Kappa
AVSD@DSTC7
RLM 1.62 1.61 1.35 0.65
CRMSG 1.75 1.68 1.52 0.68
THAM 1.78 1.70 1.56 0.69
COSTA 1.83 1.78 1.71 0.74
DVD
HRNN 1.54 1.59 1.28 0.61
VDTN 1.68 1.65 1.45 0.64
MTN 1.72 1.66 1.52 0.67
COSTA 1.75 1.73 1.66 0.70
Tab.4  Human evaluation results on the test set AVSD@DSTC7 and DVD
Evaluation aspectBLIP-2w/COSTAVideo Chatw/COSTALLaMA Adapterw/COSTAVideo LLaMAw/COSTAVideo-ChatGPTw/COSTAGPT-4V
Correctness of Information1.942.002.232.302.032.061.962.012.402.542.50
Detail Orientation2.092.182.502.612.322.462.182.302.522.682.65
Contextual Understanding2.112.172.532.712.302.372.162.182.622.732.69
Temporal Understanding1.831.871.942.041.982.101.821.901.982.112.03
Consistency1.851.892.242.282.152.201.791.852.372.442.47
Tab.5  Performance of LLM-based and their COSTA-enhanced models on video-based text generation performance benchmarking
AVSD@DSTC7
Models BLEU4 METEOR ROUGE-L CIDEr BERTScore
COSTA 0.490 0.315 0.642 1.388 0.641
- TF 0.468 0.308 0.615 1.329 0.609
- SF 0.489 0.310 0.639 1.386 0.629
- TF - SF 0.455 0.299 0.610 1.305 0.600
- ICA 0.469 0.305 0.616 1.324 0.627
AVSD@DSTC8
COSTA 0.473 0.307 0.631 1.353 0.633
- TF 0.452 0.295 0.595 1.259 0.601
- SF 0.467 0.302 0.622 1.332 0.626
- TF - SF 0.440 0.286 0.580 1.234 0.597
- ICA 0.449 0.295 0.598 1.283 0.614
Tab.6  Ablation analysis on two AVSD benchmarks with model variants of COSTA
Fig.3  Performances of COSTA under different configuration settings on the testset of AVSD@DSTC7 and DVD. (a) The CIDEr score of COSTA under diferrent settings of Topc in the testset of AVSD@DSTC7; (b) the CIDEr score of COSTA under diferrent settings of Topr in the testset of AVSD@DSTC7; (c) the CIDEr score of COSTA under diferrent settings of iteration times T in the testset of AVSD@DSTC7; (d) the Accuracy of COSTA under diferrent settings of iteration times T in the testset of DVD; (e) the CIDEr score of COSTA under diferrent settings of video clips L in the testset of AVSD@DSTC7; (f) the CIDEr score of COSTA under diferrent settings of sampled image frames M in the testset of AVSD@DSTC7
Response time (s) Parameters AVSD VGPB
RTX 3090 Tesla A100 RTX 3090 Tesla A100
DialogMCF 0.12 B 5.67 1.92 7.05 2.58
THAM 0.22 B 3.83 1.18 4.16 1.42
COSTA 0.10 B 2.04 0.55 2.82 0.79
Video LLaMA 7 B ? 8.66 ? 10.01
w/ COSTA 7 B ? 8.35 ? 9.48
LLaMA Adapter 7 B ? 9.01 ? 9.84
w/ COSTA 7 B ? 8.84 ? 9.27
Video Chat 7 B ? 7.89 ? 8.62
w/ COSTA 7 B ? 7.62 ? 8.25
Video-ChatGPT 7 B ? 7.55 ? 8.58
w/ COSTA 7 B ? 7.24 ? 7.93
Tab.7  The response times of COSTA and baselines models on two benchmarks and two devices. VGPB represents the Video-based Text Generation Performance Benchmarking. B represents one billion parameters
Fig.4  Qualitative examples from the testset of AVSD@DSTC7. The A3 (COSTA) is the response generated by COSTA based on the input video and dialogue context in each sample. We visualize the iterative steps, where image frames marked by green and orange boxes are the successive cross-attention steps, respectively. The brighter regions within each frame indicate higher attention scores
Fig.5  Qualitative examples from the testset of AVSD@DSTC7
Fig.6  Qualitative examples from the testset of AVSD@DSTC7
  
  
  
  
  
  
  
1 Q, Wu D, Teney P, Wang C, Shen A, Dick Den Hengel A Van . Visual question answering: a survey of methods and datasets. Computer Vision and Image Understanding, 2017, 163: 21–40
2 Y, Zhong W, Ji J, Xiao Y, Li W, Deng T S Chua . Video question answering: datasets, algorithms and challenges. In: Proceedings of 2022 Conference on Empirical Methods in Natural Language Processing. 2022, 6439−6455
3 R, Liu Y Han . Instance-sequence reasoning for video question answering. Frontiers of Computer Science, 2022, 16( 6): 166708
4 A, Das S, Kottur K, Gupta A, Singh D, Yadav J M F, Moura D, Parikh D Batra . Visual dialog. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017, 1080−1089
5 C, Chen Z, Tan Q, Cheng X, Jiang Q, Liu Y, Zhu X Gu . UTC: A unified transformer with inter-task contrastive learning for visual dialog. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022, 18082−18091
6 H, Alamri V, Cartillier A, Das J, Wang A, Cherian I, Essa D, Batra T K, Marks C, Hori P, Anderson S, Lee D Parikh . Audio visual scene-aware dialog. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019, 7550−7559
7 C, Hori H, Alamri J, Wang G, Wichern T, Hori A, Cherian T K, Marks V, Cartillier R G, Lopes A, Das I, Essa D, Batra D Parikh . End-to-end audio visual scene-aware dialog using multimodal attention-based video features. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2019, 2352−2356
8 N, Robinson B, Tidd D, Campbell D, Kulić P Corke . Robotic vision for human-robot interaction and collaboration: a survey and systematic review. ACM Transactions on Human-Robot Interaction, 2023, 12( 1): 12
9 J, Gu E, Stefani Q, Wu J, Thomason X Wang . Vision-and-language navigation: a survey of tasks, methods, and future directions. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2022, 7606−7623
10 S M, Preum S, Munir M, Ma M S, Yasar D J, Stone R, Williams H, Alemzadeh J A Stankovic . A review of cognitive assistants for healthcare: trends, prospects, and future directions. ACM Computing Surveys (CSUR), 2021, 53( 6): 130
11 J, Carreira A Zisserman . Quo vadis, action recognition? A new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017, 4724−4733
12 A, Vaswani N, Shazeer N, Parmar J, Uszkoreit L, Jones A N, Gomez Ł, Kaiser I Polosukhin . Attention is all you need. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. 2017, 6000−6010
13 H, Le D, Sahoo N, Chen S Hoi . Multimodal transformer networks for end-to-end video-grounded dialogue systems. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019, 5612−5623
14 H, Le N F Chen . Multimodal transformer with pointer network for the DSTC8 AVSD challenge. 2020, arXiv preprint arXiv: 2002.10695
15 J, Johnson R, Krishna M, Stark L J, Li D A, Shamma M S, Bernstein L Fei-Fei . Image retrieval using scene graphs. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015, 3668−3678
16 S, Geng P, Gao M, Chatterjee C, Hori Roux J, Le Y, Zhang H, Li A Cherian . Dynamic graph representation learning for video dialog via multi-modal shuffled transformers. In: Proceedings of the 35th AAAI Conference on Artificial Intelligence. 2021, 1415−1423
17 P, Velickovic G, Cucurull A, Casanova A, Romero P, Liò Y Bengio . Graph attention networks. In: Proceedings of the 6th International Conference on Learning Representations. 2018
18 J, Kim S, Yoon D, Kim C D Yoo . Structured co-reference graph attention for video-grounded dialogue. In: Proceedings of the 35th AAAI Conference on Artificial Intelligence. 2021, 1789−1797
19 H, Le N F, Chen S C H Hoi . Learning reasoning paths over semantic graphs for video-grounded dialogues. In: Proceedings of the 9th International Conference on Learning Representations. 2021
20 X, Zhao Y, Wang C, Tao C, Wang D Zhao . Collaborative reasoning on multi-modal semantic graphs for video-grounded dialogue generation. In: Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2022. 2022, 5988−5998
21 W X, Zhao K, Zhou J, Li T, Tang X, Wang Y, Hou Y, Min B, Zhang J, Zhang Z, Dong Y, Du C, Yang Y, Chen Z, Chen J, Jiang R, Ren Y, Li X, Tang Z, Liu P, Liu J Y, Nie J R Wen . A survey of large language models. 2023, arXiv preprint arXiv: 2303.18223
22 B, Min H, Ross E, Sulem A P B, Veyseh T H, Nguyen O, Sainz E, Agirre I, Heintz D Roth . Recent advances in natural language processing via large pre-trained language models: a survey. ACM Computing Surveys, 2024, 56( 2): 30
23 H, Le S C H Hoi . Video-grounded dialogues with pretrained generation language models. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020, 5842−5848
24 Z, Li Z, Li J, Zhang Y, Feng J Zhou . Bridging text and video: a universal multimodal transformer for audio-visual scene-aware dialog. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2021, 29: 2476–2483
25 A, Radford J, Wu R, Child D, Luan D, Amodei I Sutskever . Language models are unsupervised multitask learners. OpenAI blog, 2019, 1(8): 9
26 A, Radford J W, Kim C, Hallacy A, Ramesh G, Goh S, Agarwal G, Sastry A, Askell P, Mishkin J, Clark G, Krueger I Sutskever . Learning transferable visual models from natural language supervision. In: Proceedings of the 38th International Conference on Machine Learning. 2021, 8748−8763
27 J, Li D, Li S, Savarese S Hoi . BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. In: Proceedings of the 40th International Conference on Machine Learning. 2023, 814
28 H, Touvron L, Martin K, Stone P, Albert A, Almahairi Y, Babaei N, Bashlykov S, Batra P, Bhargava S, Bhosale D, Bikel L, Blecher C C, Ferrer M, Chen G, Cucurull D, Esiobu J, Fernandes J, Fu W, Fu B, Fuller C, Gao V, Goswami N, Goyal A, Hartshorn S, Hosseini R, Hou H, Inan M, Kardas V, Kerkez M, Khabsa I, Kloumann A, Korenev P S, Koura M A, Lachaux T, Lavril J, Lee D, Liskovich Y, Lu Y, Mao X, Martinet T, Mihaylov P, Mishra I, Molybog Y, Nie A, Poulton J, Reizenstein R, Rungta K, Saladi A, Schelten R, Silva E M, Smith R, Subramanian X E, Tan B, Tang R, Taylor A, Williams J X, Kuan P, Xu Z, Yan I, Zarov Y, Zhang A, Fan M, Kambadur S, Narang A, Rodriguez R, Stojnic S, Edunov T Scialom . Llama 2: open foundation and fine-tuned chat models. 2023, arXiv preprint arXiv: 2307.09288
29 The Vicuna Team. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatGPT quality. See vicuna. lmsys. org website (accessed 14 April 2023), 2023, 2(3): 6
30 K, Li Y, He Y, Wang Y, Li W, Wang P, Luo Y, Wang L, Wang Y Qiao . VideoChat: chat-centric video understanding. 2023, arXiv preprint arXiv: 2305.06355
31 M, Maaz H, Rasheed S, Khan F S Khan . Video-chatGPT: towards detailed video understanding via large vision and language models. 2023, arXiv preprint arXiv: 2306.05424
32 H, Zhang X, Li L Bing . Video-LLaMA: an instruction-tuned audio-visual language model for video understanding. In: Proceedings of 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. 2023, 543−553
33 L Perlovsky . Language and cognition interaction neural mechanisms. Computational Intelligence and Neuroscience, 2011, 2011: 454587
34 R, Hong D, Liu X, Mo X, He H Zhang . Learning to compose and reason with language tree structures for visual grounding. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022, 44( 2): 684–696
35 W, Lin J, Chen J, Mei A, Coca B Byrne . Fine-grained late-interaction multi-modal retrieval for retrieval augmented visual question answering. In: Proceedings of the 37th International Conference on Neural Information Processing Systems. 2023, 990
36 S, Yoon E, Yoon H S, Yoon J, Kim C D Yoo . Information-theoretic text hallucination reduction for video-grounded dialogue. In: Proceedings of 2022 Conference on Empirical Methods in Natural Language Processing. 2022, 4182−4193
37 L, Wang C, Ma X, Feng Z, Zhang H, Yang J, Zhang Z, Chen J, Tang X, Chen Y, Lin W, Zhao Z, Wei J Wen . A survey on large language model based autonomous agents. Frontiers of Computer Science, 2024, 18( 6): 186345
38 P, Xu X, Zhu D A Clifton . Multimodal learning with transformers: a survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45( 10): 12113–12132
39 J, Zhang J, Huang S, Jin S Lu . Vision-language models for vision tasks: a survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024, 46( 8): 5625–5644
40 Y, Yang J, Guo G, Li L, Li W, Li J Yang . Alignment efficient image-sentence retrieval considering transferable cross-modal representation learning. Frontiers of Computer Science, 2024, 18( 1): 181335
41 A Q, Jiang A, Sablayrolles A, Roux A, Mensch B, Savary C, Bamford D S, Chaplot Las Casas D, De E B, Hanna F, Bressand G, Lengyel G, Bour G, Lample L R, Lavaud L, Saulnier M A, Lachaux P, Stock S, Subramanian S, Yang S, Antoniak Scao T, Le T, Gervet T, Lavril T, Wang T, Lacroix Sayed W El . Mixtral of experts. 2024, arXiv preprint arXiv: 2401.04088
42 J, Cen C, Wu X, Liu S, Yin Y, Pei J, Yang Q, Chen N, Duan J Zhang . Using left and right brains together: Towards vision and language planning. 2024, arXiv preprint arXiv: 2402.10534
43 A, Blattmann T, Dockhorn S, Kulal D, Mendelevitch M, Kilian D, Lorenz Y, Levi Z, English V, Voleti A, Letts V, Jampani R Rombach . Stable video diffusion: scaling latent video diffusion models to large datasets. 2023, arXiv preprint arXiv: 2311.15127
44 A, Dosovitskiy L, Beyer A, Kolesnikov D, Weissenborn X, Zhai T, Unterthiner M, Dehghani M, Minderer G, Heigold S, Gelly J, Uszkoreit N Houlsby . An image is worth 16x16 words: transformers for image recognition at scale. In: Proceedings of the 9th International Conference on Learning Representations. 2021
45 S, Khan M, Naseer M, Hayat S W, Zamir F S, Khan M Shah . Transformers in vision: a survey. ACM computing surveys (CSUR), 2022, 54( 10s): 200
46 E, Jang S, Gu B Poole . Categorical reparameterization with gumbel-softmax. In: Proceedings of the 5th International Conference on Learning Representations. 2017
47 C J, Maddison A, Mnih Y W Teh . The concrete distribution: a continuous relaxation of discrete random variables. In: Proceedings of the 5th International Conference on Learning Representations. 2017
48 Y, Niu H, Zhang M, Zhang J, Zhang Z, Lu J R Wen . Recursive visual attention in visual dialog. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019, 6672−6681
49 F Pulvermüller . Neural reuse of action perception circuits for language, concepts and communication. Progress in Neurobiology, 2018, 160: 1–44
50 R, Müller S, Kornblith G Hinton . When does label smoothing help? In: Proceedings of the 33rd International Conference on Neural Information Processing Systems. 2019, 422
51 A, Goyal A, Lamb Y, Zhang S, Zhang A, Courville Y Bengio . Professor forcing: a new algorithm for training recurrent networks. In: Proceedings of the 30th International Conference on Neural Information Processing Systems. 2016, 4608−4616
52 H, Le C, Sankar S, Moon A, Beirami A, Geramifard S Kottur . DVD: a diagnostic dataset for multi-step reasoning in video grounded dialogue. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 2021, 5651−5665
53 G A, Sigurdsson G, Varol X, Wang A, Farhadi I, Laptev A Gupta . Hollywood in homes: Crowdsourcing data collection for activity understanding. In: Proceedings of the 14th European Conference on Computer Vision. 2016, 510−526
54 H, Alamri C, Hori T K, Marks D, Batra D Parikh . Audio visual scene-aware dialog (AVSD) track for natural language generation in DSTC7. In: Proceedings of DSTC7 at AAAI2019Workshop. 2018
55 C, Hori A, Cherian T, Hori T K Marks . Audio visual scene-aware dialog (AVSD) track for natural language generation in DSTC8. In: Proceedings of AAAI-DSTC8 Workshop. 2020
56 H, Le N, Chen S Hoi . VGNMN: video-grounded neural module networks for video-grounded dialogue systems. In: Proceedings of 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2022, 3377−3393
57 H, Le D, Sahoo N, Chen S C H Hoi . BiST: bi-directional spatio-temporal reasoning for video-grounded dialogues. In: Proceedings of 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020, 1846−1859
58 Z, Chen H, Liu Y Wang . DialogMCF: multimodal context flow for audio visual scene-aware dialog. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023, 32: 753–764
59 R, Zhang J, Han C, Liu P, Gao A, Zhou X, Hu S, Yan P, Lu H, Li Y Qiao . LLaMA-Adapter: efficient fine-tuning of language models with zero-init attention. 2023, arXiv preprint arXiv: 2303.16199
60 K, Papineni S, Roukos T, Ward W J Zhu . BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. 2002, 311−318
61 S, Banerjee A Lavie . METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. 2005, 65−72
62 C Y Lin . ROUGE: a package for automatic evaluation of summaries. In: Proceedings of the Text Summarization Branches Out. 2004, 74−81
63 R, Vedantam Zitnick C, Lawrence D Parikh . CIDEr: consensus-based image description evaluation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015, 4566−4575
64 T, Zhang V, Kishore F, Wu K Q, Weinberger Y Artzi . BERTScore: evaluating text generation with BERT. In: Proceedings of the 8th International Conference on Learning Representations. 2020
65 R, Girdhar D Ramanan . CATER: a diagnostic dataset for compositional actions & TEmporal reasoning. In: Proceedings of the 8th International Conference on Learning Representations. 2020
66 I, Serban A, Sordoni Y, Bengio A, Courville J Pineau . Building end-to-end dialogue systems using generative hierarchical neural network models. In: Proceedings of the 30th AAAI Conference on Artificial Intelligence. 2016
67 H, Le N, Chen S Hoi . Multimodal dialogue state tracking. In: Proceedings of 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2022, 3394−3415
68 J L Fleiss . Measuring nominal scale agreement among many raters. Psychological Bulletin, 1971, 76( 5): 378–382
69 Heilbron F, Caba V, Escorcia B, Ghanem Niebles J Carlos . ActivityNet: a large-scale video benchmark for human activity understanding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015, 961−970
70 I, Loshchilov F Hutter . Decoupled weight decay regularization. In: Proceedings of the 7th International Conference on Learning Representations. 2019
71 J, Li D, Li C, Xiong S C H Hoi . Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Proceedings of the International Conference on Machine Learning. 2022, 12888−12900
72 M, Bain A, Nagrani G, Varol A Zisserman . Frozen in time: a joint video and image encoder for end-to-end retrieval. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021, 1708−1718
73 A, Paszke S, Gross F, Massa A, Lerer J, Bradbury G, Chanan T, Killeen Z, Lin N, Gimelshein L, Antiga L, Antiga A, Desmaison A, Köpf E, Yang Z, DeVito M, Raison A, Tejani S, Chilamkurthy B, Steiner L, Fang J, Bai S Chintala . PyTorch: an imperative style, high-performance deep learning library. In: Proceedings of the 33rd International Conference on Neural Information Processing Systems. 2019, 721
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed