LMR-CBT: learning modality-fused representations with CB-Transformer for multimodal emotion recognition from unaligned multimodal sequences

doi:10.1007/s11704-023-2444-y

Frontiers of Computer Science

2024, Vol. 18

Issue (4): 184314 https://doi.org/10.1007/s11704-023-2444-y

本期目录

LMR-CBT: learning modality-fused representations with CB-Transformer for multimodal emotion recognition from unaligned multimodal sequences

Ziwang FU^1,², Feng LIU^3,⁴, Qing XU^1,², Xiangling FU^1,²(

), Jiayin QI^1,^2,⁵(

)

¹. School of Computer Science (National Pilot Software Engineering School), Beijing University of Posts and Telecommunications, Beijing 100876, China
². Key Laboratory of Trustworthy Distributed Computing and Service (BUPT), Beijing 100876, China
³. Shanghai International School of Chief Technology Officer, East China Normal University, Shanghai 200062, China
⁴. School of Computer Science and Technology, East China Normal University, Shanghai 200062, China
⁵. School of Cyberspace Security, Guangzhou University, Guangdong 510006, China

全文: PDF(2506 KB) HTML

Abstract：

Learning modality-fused representations and processing unaligned multimodal sequences are meaningful and challenging in multimodal emotion recognition. Existing approaches use directional pairwise attention or a message hub to fuse language, visual, and audio modalities. However, these fusion methods are often quadratic in complexity with respect to the modal sequence length, bring redundant information and are not efficient. In this paper, we propose an efficient neural network to learn modality-fused representations with CB-Transformer (LMR-CBT) for multimodal emotion recognition from unaligned multi-modal sequences. Specifically, we first perform feature extraction for the three modalities respectively to obtain the local structure of the sequences. Then, we design an innovative asymmetric transformer with cross-modal blocks (CB-Transformer) that enables complementary learning of different modalities, mainly divided into local temporal learning, cross-modal feature fusion and global self-attention representations. In addition, we splice the fused features with the original features to classify the emotions of the sequences. Finally, we conduct word-aligned and unaligned experiments on three challenging datasets, IEMOCAP, CMU-MOSI, and CMU-MOSEI. The experimental results show the superiority and efficiency of our proposed method in both settings. Compared with the mainstream methods, our approach reaches the state-of-the-art with a minimum number of parameters.

Key words： modality-fused representations cross-model blocks multimodal emotion recognition unaligned multimodal sequences computational affection

收稿日期: 2022-07-11 出版日期: 2023-05-22

Corresponding Author(s): Xiangling FU,Jiayin QI

引用本文:

. [J]. Frontiers of Computer Science, 2024, 18(4): 184314.
Ziwang FU, Feng LIU, Qing XU, Xiangling FU, Jiayin QI. LMR-CBT: learning modality-fused representations with CB-Transformer for multimodal emotion recognition from unaligned multimodal sequences. Front. Comput. Sci., 2024, 18(4): 184314.

链接本文:

https://academic.hep.com.cn/fcs/CN/10.1007/s11704-023-2444-y
https://academic.hep.com.cn/fcs/CN/Y2024/V18/I4/184314

Fig.1

Fig.2

Setting	CMU-MOSEI	CMU-MOSI	IEMOCAP
Optimizer	Adam	Adam	Adam
Batch size	32	8	32
Learning rate	1e?3	2e-3	1e?3
Epochs	120	100	60
Feature size $d$	40	30	40
Attention head $h$	8	10	5
Kernel size (V/A)	3/3	3/1	3/5
Transformer layer $D$	5	4	5

Tab.1

Setting	Method	Happy		Sad		Angry		Neutral
Setting	Method	$A c c / %$	$F 1 / %$	$A c c / %$	$F 1 / %$	$A c c / %$	$F 1 / %$	$A c c / %$	$F 1 / %$
Aligned	EF-LSTM	86.0	84.2	80.2	80.5	85.2	84.5	67.8	67.1
	LF-LSTM	85.1	86.3	78.9	81.7	84.7	83.0	67.1	67.6
	MFM	90.2	85.8	88.4	86.1	87.5	86.7	72.1	68.1
	RAVEN	87.3	85.8	83.4	83.1	87.3	86.7	69.7	69.3
	MCTN	84.9	83.1	80.5	79.6	79.7	80.4	62.3	57.0
	MulT*	86.4	82.9	82.3	82.4	85.3	85.8	71.2	70.0
	LMF-MulT	85.3	84.1	84.1	83.4	85.7	86.2	71.2	70.8
	PMR?	91.3	89.2	87.8	87.0	88.1	87.5	73.0	71.5
	LMR-CBT(ours)	87.9	84.6	85.3	84.4	86.2	86.3	71.5	70.6
Unaligned	EF-LSTM	76.2	75.7	70.2	70.5	72.7	67.1	58.1	57.4
	LF-LSTM	72.5	71.8	72.9	70.4	68.6	67.9	59.6	56.2
	RAVEN	77.0	76.8	67.6	65.6	65.0	64.1	62.0	59.5
	MCTN	80.5	77.5	72.0	71.7	64.9	65.6	49.4	49.3
	MulT (1.07M)*	85.6	79.0	79.4	70.3	75.8	65.4	59.5	44.7
	LMF-MulT (0.86M)	85.6	79.0	79.4	70.3	75.8	65.4	59.2	44.0
	PMR (2.15M)?	86.4	83.3	78.5	75.3	75.0	71.3	63.7	60.9
	LMR-CBT (0.34M)	85.7	79.5	79.4	72.6	76.0	70.7	63.6	60.5

Tab.2

Setting	Method	$A c c 7 / %$	$A c c 2 / %$	$F 1 / %$
Aligned	EF-LSTM	33.7	75.3	75.2
	LF-LSTM	35.3	76.8	76.7
	MFM	36.2	78.1	78.1
	RAVEN	33.2	78.0	76.6
	MCTN	35.6	79.3	79.1
	MulT*	33.1	78.5	78.4
	LMF-MulT	32.4	77.9	77.9
	PMR?	40.6	83.6	83.4
	LMR-CBT(ours)	39.2	81.6	79.8
Unaligned	EF-LSTM	31.0	73.6	74.5
	LF-LSTM	33.7	77.6	77.8
	RAVEN	31.7	72.7	73.1
	MCTN	32.7	75.9	76.4
	MulT (1.07M)*	34.3	80.3	80.4
	LMF-MulT (0.84M)	34.0	78.5	78.5
	MISA (15.9M)?	41.4	81.8	81.8
	PMR (2.14M)?	40.6	82.4	82.1
	LMR-CBT (0.35M)	39.5	81.2	81.0
	LMR-CBT (1.05M)	41.4	83.1	83.1

Tab.3

Setting	Method	$A c c 7 / %$	$A c c 2 / %$	$F 1 / %$
Aligned	EF-LSTM	47.4	78.2	77.9
	LF-LSTM	48.8	80.6	80.6
	G-MFN	45.0	76.9	77.0
	RAVEN	50.0	79.1	79.5
	MCTN	49.6	79.8	80.6
	MulT*	49.3	80.5	81.1
	LMF-MulT	50.2	80.3	80.3
	PMR?	52.5	83.3	82.6
	LMR-CBT(ours)	50.7	80.5	80.9
Unaligned	EF-LSTM	46.3	76.1	75.9
	LF-LSTM	48.8	77.5	78.2
	RAVEN	45.5	75.4	75.7
	MCTN	48.2	79.3	79.7
	MulT (1.07M)*	50.4	80.7	80.6
	LMF-MulT (0.86M)	49.3	80.8	81.3
	MISA (15.9M)?	52.1	80.7	81.1
	PMR (2.15M)?	51.8	83.1	82.8
	LMR-CBT (0.41M)	51.8	80.9	81.5
	LMR-CBT (1.23M)	51.9	82.7	82.8

Tab.4

Setting	Method	$P a r a m s (M)$	$F L O P s$
Aligned	EF-LSTM	0.57	6.55
	LF-LSTM	1.23	14.15
	G-MFN	1.06	12.72
	RAVEN	1.16	13.11
	MCTN	0.48	5.57
	MulT*	1.04	12.38
	LMF-MulT	0.82	9.68
	PMR?	2.11	25.32
	LMR-CBT(ours)	0.37	4.26
Unaligned	EF-LSTM	0.61	7.02
	LF-LSTM	1.27	14.73
	RAVEN	1.20	14.4
	MCTN	0.52	5.88
	MulT*	1.07	12.41
	LMF-MulT	0.86	10.32
	MISA?	15.9	174.9
	PMR?	2.15	25.83
	LMR-CBT	0.41	4.72

Tab.5

Method	#Params(M)	$A c c 7 / %$	$A c c 2 / %$	$F 1 / %$
language	0.10	46.5	77.4	78.2
visual	0.08	43.5	66.5	68.3
audio	0.06	41.4	65.4	67.7
A-enhance	0.19	44.8	74.7	75.8
V-enhance	0.17	45.7	75.4	76.1
wo Conv	0.36	50.7	79.4	79.8
w FC	0.41	51.6	80.8	81.3
Conv1D	0.38	50.6	78.5	80.1
Transformer	1.07	51.2	79.0	81
BiLSTM	0.41	51.8	80.9	81.5
$[$ V, L $]$ ->A	0.41	50.7	79.2	80.8
$[$ L, A $]$ ->V	0.41	51.1	79.3	81.0
$[$ V, A $]$ ->L	0.41	51.8	80.9	81.5

Tab.6

Fig.3

1	D, Nguyen K, Nguyen S, Sridharan D, Dean C Fookes . Deep spatio-temporal feature fusion with compact bilinear pooling for multimodal emotion recognition. Computer Vision and Image Understanding, 2018, 174: 33–42
2	S, Poria D, Hazarika N, Majumder R Mihalcea . Beneath the tip of the iceberg: Current challenges and new directions in sentiment analysis research. IEEE Transactions on Affective Computing, 2023, 14( 1): 108–132
3	Dai W, Cahyawijaya S, Liu Z, Fung P. Multimodal end-to-end sparse model for emotion recognition. In: Proceedings of 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2021
4	R, Chandra A Krishna . Covid-19 sentiment analysis via deep learning during the rise of novel cases. PLoS One, 2021, 16( 8): e0255615
5	Tsai Y H H, Liang P P, Zadeh A, Morency L P, Salakhutdinov R. Learning factorized multimodal representations. In: Proceedings of the 7th International Conference on Learning Representations. 2019
6	Pham H, Liang P P, Manzini T, Morency L P, Póczos B. Found in translation: Learning robust joint representations by cyclic translations between modalities. In: Proceedings of the 33rd AAAI Conference on Artificial Intelligence and 31st Innovative Applications of Artificial Intelligence Conference and 9th AAAI Symposium on Educational Advances in Artificial Intelligence. 2019
7	Sahay S, Okur E, Kumar S H, Nachman L. Low rank fusion based transformers for multimodal sequences. In: Proceedings of the 2nd Grand-Challenge and Workshop on Multimodal Language (Challenge-HML). 2020
8	W, Rahman M K, Hasan S, Lee A B, Zadeh C, Mao L P, Morency E Hoque . Integrating multimodal information in large pretrained transformers. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020
9	D, Hazarika R, Zimmermann S Poria . MISA: Modality-invariant and -specific representations for multimodal sentiment analysis. In: Proceedings of the 28th ACM International Conference on Multimedia. 2020
10	Yu W, Xu H, Yuan Z, Wu J. Learning modality-specific representations with self-supervised multi-task learning for multimodal sentiment analysis. In: Proceedings of the 35th AAAI Conference on Artificial Intelligence. 2021
11	W, Dai S, Cahyawijaya Y, Bang P Fung . Weakly-supervised multi-task learning for multimodal affect recognition. 2021, arXiv preprint arXiv: 2104.11560
12	A, Vaswani N, Shazeer N, Parmar J, Uszkoreit L, Jones A N, Gomez Ł, Kaiser I Polosukhin . Attention is all you need. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. 2017, 6000−6010
13	Y H H, Tsai S, Bai P P, Liang J Z, Kolter L P, Morency R Salakhutdinov . Multimodal transformer for unaligned multimodal language sequences. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019
14	Lv F, Chen X, Huang Y, Duan L, Lin G. Progressive modality reinforcement for human multimodal emotion recognition from unaligned multimodal sequences. In: Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2021, 2554−2562
15	C, Busso M, Bulut C C, Lee A, Kazemzadeh E, Mower S, Kim J, Chang S, Lee S S Narayanan . IEMOCAP: Interactive emotional dyadic motion capture database. Language Resources and Evaluation, 2008, 42( 4): 335–359
16	A, Zadeh R, Zellers E, Pincus L P Morency . Multimodal sentiment intensity analysis in videos: Facial gestures and verbal messages. IEEE Intelligent Systems, 2016, 31( 6): 82–88
17	A B, Zadeh P P, Liang S, Poria E, Cambria L P Morency . Multimodal language analysis in the wild: CMU-MOSEI dataset and interpretable dynamic fusion graph. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2018, 2236−2246
18	L P, Morency R, Mihalcea P Doshi . Towards multimodal sentiment analysis: Harvesting opinions from the web. In: Proceedings of the 13th International Conference on Multimodal Interfaces. 2011, 169−176
19	V, Pérez-Rosas R, Mihalcea L P Morency . Utterance-level multimodal sentiment analysis. In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2013
20	A, Zadeh R, Zellers E, Pincus L P Morency . MOSI: Multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos. 2016, arXiv preprint arXiv: 1606.06259
21	H, Wang A, Meghawat L P, Morency E P Xing . Select-additive learning: Improving generalization in multimodal sentiment analysis. In: Proceedings of 2017 IEEE International Conference on Multimedia and Expo (ICME). 2017, 949−954
22	Y, Wang Y, Shen Z, Liu P P, Liang A, Zadeh L P Morency . Words can shift: Dynamically adjusting word representations using nonverbal behaviors. Proceedings of the AAAI Conference on Artificial Intelligence, 2019, 33( 1): 7216–7223
23	Z, Zeng J, Tu B, Pianfetti M, Liu T, Zhang Z, Zhang T S, Huang S Levinson . Audio-visual affect recognition through multi-stream fused HMM for HCI. In: Proceedings of 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05). 2005, 967−972
24	W, Dai Z, Liu T, Yu P Fung . Modality-transferable emotion embeddings for low-resource multimodal emotion recognition. In: Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing. 2020
25	Pennington J, Socher R, Manning C. Glove: Global vectors for word representation. In: Proceedings of 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2014, 1532−1543
26	T, Baltrusaitis P, Robinson L P Morency . OpenFace: An open source facial behavior analysis toolkit. In: Proceedings of 2016 IEEE Winter Conference on Applications of Computer Vision (WACV). 2016, 1−10
27	G, Degottex J, Kane T, Drugman T, Raitio S Scherer . COVAREP — a collaborative voice analysis repository for speech technologies. In: Proceedings of 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2014, 960−964
28	A, Graves S, Fernández F, Gomez J Schmidhuber . Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks. In: Proceedings of the 23rd International Conference on Machine Learning. 2006, 369−376

[1]

FCS-22444-OF-ZF_suppl_1

Download

Viewed

Full text

Abstract

Cited

Shared

Discussed