1. School of Computer Science (National Pilot Software Engineering School), Beijing University of Posts and Telecommunications, Beijing 100876, China 2. Key Laboratory of Trustworthy Distributed Computing and Service (BUPT), Beijing 100876, China 3. Shanghai International School of Chief Technology Officer, East China Normal University, Shanghai 200062, China 4. School of Computer Science and Technology, East China Normal University, Shanghai 200062, China 5. School of Cyberspace Security, Guangzhou University, Guangdong 510006, China
Learning modality-fused representations and processing unaligned multimodal sequences are meaningful and challenging in multimodal emotion recognition. Existing approaches use directional pairwise attention or a message hub to fuse language, visual, and audio modalities. However, these fusion methods are often quadratic in complexity with respect to the modal sequence length, bring redundant information and are not efficient. In this paper, we propose an efficient neural network to learn modality-fused representations with CB-Transformer (LMR-CBT) for multimodal emotion recognition from unaligned multi-modal sequences. Specifically, we first perform feature extraction for the three modalities respectively to obtain the local structure of the sequences. Then, we design an innovative asymmetric transformer with cross-modal blocks (CB-Transformer) that enables complementary learning of different modalities, mainly divided into local temporal learning, cross-modal feature fusion and global self-attention representations. In addition, we splice the fused features with the original features to classify the emotions of the sequences. Finally, we conduct word-aligned and unaligned experiments on three challenging datasets, IEMOCAP, CMU-MOSI, and CMU-MOSEI. The experimental results show the superiority and efficiency of our proposed method in both settings. Compared with the mainstream methods, our approach reaches the state-of-the-art with a minimum number of parameters.
D, Nguyen K, Nguyen S, Sridharan D, Dean C Fookes . Deep spatio-temporal feature fusion with compact bilinear pooling for multimodal emotion recognition. Computer Vision and Image Understanding, 2018, 174: 33–42
2
S, Poria D, Hazarika N, Majumder R Mihalcea . Beneath the tip of the iceberg: Current challenges and new directions in sentiment analysis research. IEEE Transactions on Affective Computing, 2023, 14( 1): 108–132
3
Dai W, Cahyawijaya S, Liu Z, Fung P. Multimodal end-to-end sparse model for emotion recognition. In: Proceedings of 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2021
4
R, Chandra A Krishna . Covid-19 sentiment analysis via deep learning during the rise of novel cases. PLoS One, 2021, 16( 8): e0255615
5
Tsai Y H H, Liang P P, Zadeh A, Morency L P, Salakhutdinov R. Learning factorized multimodal representations. In: Proceedings of the 7th International Conference on Learning Representations. 2019
6
Pham H, Liang P P, Manzini T, Morency L P, Póczos B. Found in translation: Learning robust joint representations by cyclic translations between modalities. In: Proceedings of the 33rd AAAI Conference on Artificial Intelligence and 31st Innovative Applications of Artificial Intelligence Conference and 9th AAAI Symposium on Educational Advances in Artificial Intelligence. 2019
7
Sahay S, Okur E, Kumar S H, Nachman L. Low rank fusion based transformers for multimodal sequences. In: Proceedings of the 2nd Grand-Challenge and Workshop on Multimodal Language (Challenge-HML). 2020
8
W, Rahman M K, Hasan S, Lee A B, Zadeh C, Mao L P, Morency E Hoque . Integrating multimodal information in large pretrained transformers. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020
9
D, Hazarika R, Zimmermann S Poria . MISA: Modality-invariant and -specific representations for multimodal sentiment analysis. In: Proceedings of the 28th ACM International Conference on Multimedia. 2020
10
Yu W, Xu H, Yuan Z, Wu J. Learning modality-specific representations with self-supervised multi-task learning for multimodal sentiment analysis. In: Proceedings of the 35th AAAI Conference on Artificial Intelligence. 2021
11
W, Dai S, Cahyawijaya Y, Bang P Fung . Weakly-supervised multi-task learning for multimodal affect recognition. 2021, arXiv preprint arXiv: 2104.11560
12
A, Vaswani N, Shazeer N, Parmar J, Uszkoreit L, Jones A N, Gomez Ł, Kaiser I Polosukhin . Attention is all you need. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. 2017, 6000−6010
13
Y H H, Tsai S, Bai P P, Liang J Z, Kolter L P, Morency R Salakhutdinov . Multimodal transformer for unaligned multimodal language sequences. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019
14
Lv F, Chen X, Huang Y, Duan L, Lin G. Progressive modality reinforcement for human multimodal emotion recognition from unaligned multimodal sequences. In: Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2021, 2554−2562
15
C, Busso M, Bulut C C, Lee A, Kazemzadeh E, Mower S, Kim J, Chang S, Lee S S Narayanan . IEMOCAP: Interactive emotional dyadic motion capture database. Language Resources and Evaluation, 2008, 42( 4): 335–359
16
A, Zadeh R, Zellers E, Pincus L P Morency . Multimodal sentiment intensity analysis in videos: Facial gestures and verbal messages. IEEE Intelligent Systems, 2016, 31( 6): 82–88
17
A B, Zadeh P P, Liang S, Poria E, Cambria L P Morency . Multimodal language analysis in the wild: CMU-MOSEI dataset and interpretable dynamic fusion graph. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2018, 2236−2246
18
L P, Morency R, Mihalcea P Doshi . Towards multimodal sentiment analysis: Harvesting opinions from the web. In: Proceedings of the 13th International Conference on Multimodal Interfaces. 2011, 169−176
19
V, Pérez-Rosas R, Mihalcea L P Morency . Utterance-level multimodal sentiment analysis. In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2013
20
A, Zadeh R, Zellers E, Pincus L P Morency . MOSI: Multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos. 2016, arXiv preprint arXiv: 1606.06259
21
H, Wang A, Meghawat L P, Morency E P Xing . Select-additive learning: Improving generalization in multimodal sentiment analysis. In: Proceedings of 2017 IEEE International Conference on Multimedia and Expo (ICME). 2017, 949−954
22
Y, Wang Y, Shen Z, Liu P P, Liang A, Zadeh L P Morency . Words can shift: Dynamically adjusting word representations using nonverbal behaviors. Proceedings of the AAAI Conference on Artificial Intelligence, 2019, 33( 1): 7216–7223
23
Z, Zeng J, Tu B, Pianfetti M, Liu T, Zhang Z, Zhang T S, Huang S Levinson . Audio-visual affect recognition through multi-stream fused HMM for HCI. In: Proceedings of 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05). 2005, 967−972
24
W, Dai Z, Liu T, Yu P Fung . Modality-transferable emotion embeddings for low-resource multimodal emotion recognition. In: Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing. 2020
25
Pennington J, Socher R, Manning C. Glove: Global vectors for word representation. In: Proceedings of 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2014, 1532−1543
26
T, Baltrusaitis P, Robinson L P Morency . OpenFace: An open source facial behavior analysis toolkit. In: Proceedings of 2016 IEEE Winter Conference on Applications of Computer Vision (WACV). 2016, 1−10
27
G, Degottex J, Kane T, Drugman T, Raitio S Scherer . COVAREP — a collaborative voice analysis repository for speech technologies. In: Proceedings of 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2014, 960−964
28
A, Graves S, Fernández F, Gomez J Schmidhuber . Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks. In: Proceedings of the 23rd International Conference on Machine Learning. 2006, 369−376