Please wait a minute...
Frontiers of Computer Science

ISSN 2095-2228

ISSN 2095-2236(Online)

CN 10-1014/TP

邮发代号 80-970

2019 Impact Factor: 1.275

Frontiers of Computer Science  2024, Vol. 18 Issue (4): 184314   https://doi.org/10.1007/s11704-023-2444-y
  本期目录
LMR-CBT: learning modality-fused representations with CB-Transformer for multimodal emotion recognition from unaligned multimodal sequences
Ziwang FU1,2, Feng LIU3,4, Qing XU1,2, Xiangling FU1,2(), Jiayin QI1,2,5()
1. School of Computer Science (National Pilot Software Engineering School), Beijing University of Posts and Telecommunications, Beijing 100876, China
2. Key Laboratory of Trustworthy Distributed Computing and Service (BUPT), Beijing 100876, China
3. Shanghai International School of Chief Technology Officer, East China Normal University, Shanghai 200062, China
4. School of Computer Science and Technology, East China Normal University, Shanghai 200062, China
5. School of Cyberspace Security, Guangzhou University, Guangdong 510006, China
 全文: PDF(2506 KB)   HTML
Abstract

Learning modality-fused representations and processing unaligned multimodal sequences are meaningful and challenging in multimodal emotion recognition. Existing approaches use directional pairwise attention or a message hub to fuse language, visual, and audio modalities. However, these fusion methods are often quadratic in complexity with respect to the modal sequence length, bring redundant information and are not efficient. In this paper, we propose an efficient neural network to learn modality-fused representations with CB-Transformer (LMR-CBT) for multimodal emotion recognition from unaligned multi-modal sequences. Specifically, we first perform feature extraction for the three modalities respectively to obtain the local structure of the sequences. Then, we design an innovative asymmetric transformer with cross-modal blocks (CB-Transformer) that enables complementary learning of different modalities, mainly divided into local temporal learning, cross-modal feature fusion and global self-attention representations. In addition, we splice the fused features with the original features to classify the emotions of the sequences. Finally, we conduct word-aligned and unaligned experiments on three challenging datasets, IEMOCAP, CMU-MOSI, and CMU-MOSEI. The experimental results show the superiority and efficiency of our proposed method in both settings. Compared with the mainstream methods, our approach reaches the state-of-the-art with a minimum number of parameters.

Key wordsmodality-fused representations    cross-model blocks    multimodal emotion recognition    unaligned multimodal sequences    computational affection
收稿日期: 2022-07-11      出版日期: 2023-05-22
Corresponding Author(s): Xiangling FU,Jiayin QI   
 引用本文:   
. [J]. Frontiers of Computer Science, 2024, 18(4): 184314.
Ziwang FU, Feng LIU, Qing XU, Xiangling FU, Jiayin QI. LMR-CBT: learning modality-fused representations with CB-Transformer for multimodal emotion recognition from unaligned multimodal sequences. Front. Comput. Sci., 2024, 18(4): 184314.
 链接本文:  
https://academic.hep.com.cn/fcs/CN/10.1007/s11704-023-2444-y
https://academic.hep.com.cn/fcs/CN/Y2024/V18/I4/184314
Fig.1  
Fig.2  
  
Setting CMU-MOSEI CMU-MOSI IEMOCAP
Optimizer Adam Adam Adam
Batch size 32 8 32
Learning rate 1e?3 2e-3 1e?3
Epochs 120 100 60
Feature size d 40 30 40
Attention head h 8 10 5
Kernel size (V/A) 3/3 3/1 3/5
Transformer layer D 5 4 5
Tab.1  
Setting Method Happy Sad Angry Neutral
Acc/% F1/% Acc/% F1/% Acc/% F1/% Acc/% F1/%
Aligned EF-LSTM 86.0 84.2 80.2 80.5 85.2 84.5 67.8 67.1
LF-LSTM 85.1 86.3 78.9 81.7 84.7 83.0 67.1 67.6
MFM 90.2 85.8 88.4 86.1 87.5 86.7 72.1 68.1
RAVEN 87.3 85.8 83.4 83.1 87.3 86.7 69.7 69.3
MCTN 84.9 83.1 80.5 79.6 79.7 80.4 62.3 57.0
MulT* 86.4 82.9 82.3 82.4 85.3 85.8 71.2 70.0
LMF-MulT 85.3 84.1 84.1 83.4 85.7 86.2 71.2 70.8
PMR? 91.3 89.2 87.8 87.0 88.1 87.5 73.0 71.5
LMR-CBT(ours) 87.9 84.6 85.3 84.4 86.2 86.3 71.5 70.6
Unaligned EF-LSTM 76.2 75.7 70.2 70.5 72.7 67.1 58.1 57.4
LF-LSTM 72.5 71.8 72.9 70.4 68.6 67.9 59.6 56.2
RAVEN 77.0 76.8 67.6 65.6 65.0 64.1 62.0 59.5
MCTN 80.5 77.5 72.0 71.7 64.9 65.6 49.4 49.3
MulT (1.07M)* 85.6 79.0 79.4 70.3 75.8 65.4 59.5 44.7
LMF-MulT (0.86M) 85.6 79.0 79.4 70.3 75.8 65.4 59.2 44.0
PMR (2.15M)? 86.4 83.3 78.5 75.3 75.0 71.3 63.7 60.9
LMR-CBT (0.34M) 85.7 79.5 79.4 72.6 76.0 70.7 63.6 60.5
Tab.2  
Setting Method Acc7/% Acc2/% F1/%
Aligned EF-LSTM 33.7 75.3 75.2
LF-LSTM 35.3 76.8 76.7
MFM 36.2 78.1 78.1
RAVEN 33.2 78.0 76.6
MCTN 35.6 79.3 79.1
MulT* 33.1 78.5 78.4
LMF-MulT 32.4 77.9 77.9
PMR? 40.6 83.6 83.4
LMR-CBT(ours) 39.2 81.6 79.8
Unaligned EF-LSTM 31.0 73.6 74.5
LF-LSTM 33.7 77.6 77.8
RAVEN 31.7 72.7 73.1
MCTN 32.7 75.9 76.4
MulT (1.07M)* 34.3 80.3 80.4
LMF-MulT (0.84M) 34.0 78.5 78.5
MISA (15.9M)? 41.4 81.8 81.8
PMR (2.14M)? 40.6 82.4 82.1
LMR-CBT (0.35M) 39.5 81.2 81.0
LMR-CBT (1.05M) 41.4 83.1 83.1
Tab.3  
Setting Method Acc7/% Acc2/% F1/%
Aligned EF-LSTM 47.4 78.2 77.9
LF-LSTM 48.8 80.6 80.6
G-MFN 45.0 76.9 77.0
RAVEN 50.0 79.1 79.5
MCTN 49.6 79.8 80.6
MulT* 49.3 80.5 81.1
LMF-MulT 50.2 80.3 80.3
PMR? 52.5 83.3 82.6
LMR-CBT(ours) 50.7 80.5 80.9
Unaligned EF-LSTM 46.3 76.1 75.9
LF-LSTM 48.8 77.5 78.2
RAVEN 45.5 75.4 75.7
MCTN 48.2 79.3 79.7
MulT (1.07M)* 50.4 80.7 80.6
LMF-MulT (0.86M) 49.3 80.8 81.3
MISA (15.9M)? 52.1 80.7 81.1
PMR (2.15M)? 51.8 83.1 82.8
LMR-CBT (0.41M) 51.8 80.9 81.5
LMR-CBT (1.23M) 51.9 82.7 82.8
Tab.4  
Setting Method Params(M) FLOPs
Aligned EF-LSTM 0.57 6.55
LF-LSTM 1.23 14.15
G-MFN 1.06 12.72
RAVEN 1.16 13.11
MCTN 0.48 5.57
MulT* 1.04 12.38
LMF-MulT 0.82 9.68
PMR? 2.11 25.32
LMR-CBT(ours) 0.37 4.26
Unaligned EF-LSTM 0.61 7.02
LF-LSTM 1.27 14.73
RAVEN 1.20 14.4
MCTN 0.52 5.88
MulT* 1.07 12.41
LMF-MulT 0.86 10.32
MISA? 15.9 174.9
PMR? 2.15 25.83
LMR-CBT 0.41 4.72
Tab.5  
Method #Params(M) Acc7/% Acc2/% F1/%
language 0.10 46.5 77.4 78.2
visual 0.08 43.5 66.5 68.3
audio 0.06 41.4 65.4 67.7
A-enhance 0.19 44.8 74.7 75.8
V-enhance 0.17 45.7 75.4 76.1
wo Conv 0.36 50.7 79.4 79.8
w FC 0.41 51.6 80.8 81.3
Conv1D 0.38 50.6 78.5 80.1
Transformer 1.07 51.2 79.0 81
BiLSTM 0.41 51.8 80.9 81.5
[V, L]->A 0.41 50.7 79.2 80.8
[L, A]->V 0.41 51.1 79.3 81.0
[V, A]->L 0.41 51.8 80.9 81.5
Tab.6  
Fig.3  
  
  
  
  
  
1 D, Nguyen K, Nguyen S, Sridharan D, Dean C Fookes . Deep spatio-temporal feature fusion with compact bilinear pooling for multimodal emotion recognition. Computer Vision and Image Understanding, 2018, 174: 33–42
2 S, Poria D, Hazarika N, Majumder R Mihalcea . Beneath the tip of the iceberg: Current challenges and new directions in sentiment analysis research. IEEE Transactions on Affective Computing, 2023, 14( 1): 108–132
3 Dai W, Cahyawijaya S, Liu Z, Fung P. Multimodal end-to-end sparse model for emotion recognition. In: Proceedings of 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2021
4 R, Chandra A Krishna . Covid-19 sentiment analysis via deep learning during the rise of novel cases. PLoS One, 2021, 16( 8): e0255615
5 Tsai Y H H, Liang P P, Zadeh A, Morency L P, Salakhutdinov R. Learning factorized multimodal representations. In: Proceedings of the 7th International Conference on Learning Representations. 2019
6 Pham H, Liang P P, Manzini T, Morency L P, Póczos B. Found in translation: Learning robust joint representations by cyclic translations between modalities. In: Proceedings of the 33rd AAAI Conference on Artificial Intelligence and 31st Innovative Applications of Artificial Intelligence Conference and 9th AAAI Symposium on Educational Advances in Artificial Intelligence. 2019
7 Sahay S, Okur E, Kumar S H, Nachman L. Low rank fusion based transformers for multimodal sequences. In: Proceedings of the 2nd Grand-Challenge and Workshop on Multimodal Language (Challenge-HML). 2020
8 W, Rahman M K, Hasan S, Lee A B, Zadeh C, Mao L P, Morency E Hoque . Integrating multimodal information in large pretrained transformers. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020
9 D, Hazarika R, Zimmermann S Poria . MISA: Modality-invariant and -specific representations for multimodal sentiment analysis. In: Proceedings of the 28th ACM International Conference on Multimedia. 2020
10 Yu W, Xu H, Yuan Z, Wu J. Learning modality-specific representations with self-supervised multi-task learning for multimodal sentiment analysis. In: Proceedings of the 35th AAAI Conference on Artificial Intelligence. 2021
11 W, Dai S, Cahyawijaya Y, Bang P Fung . Weakly-supervised multi-task learning for multimodal affect recognition. 2021, arXiv preprint arXiv: 2104.11560
12 A, Vaswani N, Shazeer N, Parmar J, Uszkoreit L, Jones A N, Gomez Ł, Kaiser I Polosukhin . Attention is all you need. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. 2017, 6000−6010
13 Y H H, Tsai S, Bai P P, Liang J Z, Kolter L P, Morency R Salakhutdinov . Multimodal transformer for unaligned multimodal language sequences. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019
14 Lv F, Chen X, Huang Y, Duan L, Lin G. Progressive modality reinforcement for human multimodal emotion recognition from unaligned multimodal sequences. In: Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2021, 2554−2562
15 C, Busso M, Bulut C C, Lee A, Kazemzadeh E, Mower S, Kim J, Chang S, Lee S S Narayanan . IEMOCAP: Interactive emotional dyadic motion capture database. Language Resources and Evaluation, 2008, 42( 4): 335–359
16 A, Zadeh R, Zellers E, Pincus L P Morency . Multimodal sentiment intensity analysis in videos: Facial gestures and verbal messages. IEEE Intelligent Systems, 2016, 31( 6): 82–88
17 A B, Zadeh P P, Liang S, Poria E, Cambria L P Morency . Multimodal language analysis in the wild: CMU-MOSEI dataset and interpretable dynamic fusion graph. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2018, 2236−2246
18 L P, Morency R, Mihalcea P Doshi . Towards multimodal sentiment analysis: Harvesting opinions from the web. In: Proceedings of the 13th International Conference on Multimodal Interfaces. 2011, 169−176
19 V, Pérez-Rosas R, Mihalcea L P Morency . Utterance-level multimodal sentiment analysis. In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2013
20 A, Zadeh R, Zellers E, Pincus L P Morency . MOSI: Multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos. 2016, arXiv preprint arXiv: 1606.06259
21 H, Wang A, Meghawat L P, Morency E P Xing . Select-additive learning: Improving generalization in multimodal sentiment analysis. In: Proceedings of 2017 IEEE International Conference on Multimedia and Expo (ICME). 2017, 949−954
22 Y, Wang Y, Shen Z, Liu P P, Liang A, Zadeh L P Morency . Words can shift: Dynamically adjusting word representations using nonverbal behaviors. Proceedings of the AAAI Conference on Artificial Intelligence, 2019, 33( 1): 7216–7223
23 Z, Zeng J, Tu B, Pianfetti M, Liu T, Zhang Z, Zhang T S, Huang S Levinson . Audio-visual affect recognition through multi-stream fused HMM for HCI. In: Proceedings of 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05). 2005, 967−972
24 W, Dai Z, Liu T, Yu P Fung . Modality-transferable emotion embeddings for low-resource multimodal emotion recognition. In: Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing. 2020
25 Pennington J, Socher R, Manning C. Glove: Global vectors for word representation. In: Proceedings of 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2014, 1532−1543
26 T, Baltrusaitis P, Robinson L P Morency . OpenFace: An open source facial behavior analysis toolkit. In: Proceedings of 2016 IEEE Winter Conference on Applications of Computer Vision (WACV). 2016, 1−10
27 G, Degottex J, Kane T, Drugman T, Raitio S Scherer . COVAREP — a collaborative voice analysis repository for speech technologies. In: Proceedings of 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2014, 960−964
28 A, Graves S, Fernández F, Gomez J Schmidhuber . Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks. In: Proceedings of the 23rd International Conference on Machine Learning. 2006, 369−376
[1] FCS-22444-OF-ZF_suppl_1 Download
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed