Label distribution for multimodal machine learning

doi:10.1007/s11704-021-0611-6

Front. Comput. Sci.

2022, Vol. 16

Issue (1) : 161306 https://doi.org/10.1007/s11704-021-0611-6

RESEARCH ARTICLE

Label distribution for multimodal machine learning

Yi REN, Ning XU, Miaogen LING, Xin GENG(

)

Department of Computer Science and Engineering, Southeast University, Nanjing 211189, China

Download: PDF(724 KB)
Export: BibTeX | EndNote | Reference Manager | ProCite | RefWorks

Abstract

Multimodal machine learning (MML) aims to understand the world from multiple related modalities. It has attracted much attention as multimodal data has become increasingly available in real-world application. It is shown that MML can perform better than single-modal machine learning, since multi-modalities containing more information which could complement each other. However, it is a key challenge to fuse the multi-modalities in MML. Different from previous work, we further consider the side-information, which reflects the situation and influences the fusion of multi-modalities. We recover multimodal label distribution (MLD) by leveraging the side-information, representing the degree to which each modality contributes to describing the instance. Accordingly, a novel framework named multimodal label distribution learning (MLDL) is proposed to recover the MLD, and fuse the multimodalities with its guidance to learn an in-depth understanding of the jointly feature representation. Moreover, two versions of MLDL are proposed to deal with the sequential data. Experiments on multimodal sentiment analysis and disease prediction show that the proposed approaches perform favorably against state-of-the-art methods.

Keywords multimodal machine learning label distribution learning sentiment analysis disease prediction

Corresponding Author(s): Xin GENG

Just Accepted Date: 12 July 2021 Issue Date: 28 September 2021

Cite this article:

Yi REN,Ning XU,Miaogen LING, et al. Label distribution for multimodal machine learning[J]. Front. Comput. Sci., 2022, 16(1): 161306.

URL:

https://academic.hep.com.cn/fcs/EN/10.1007/s11704-021-0611-6
https://academic.hep.com.cn/fcs/EN/Y2022/V16/I1/161306

1	T Baltrušaitis, C Ahuja, L P Morency. Multimodal machine learning: a survey and taxonomy. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018, 41(2): 423–443 https://doi.org/10.1109/TPAMI.2018.2798607
2	C G Snoek, M Worring. Multimodal video indexing: a review of the stateof-the-art. Multimedia Tools and Applications, 2005, 25(1): 5–35 https://doi.org/10.1023/B:MTAP.0000046380.27575.a5
3	B P Yuhas, M H Goldstein, T J Sejnowski. Integration of acoustic and visual speech signals using neural networks. IEEE Communications Magazine, 1989, 27(11): 65–71 https://doi.org/10.1109/35.41402
4	H McGurk, J MacDonald. Hearing lips and seeing voices. Nature, 1976, 264(5588): 746–748 https://doi.org/10.1038/264746a0
5	J Ngiam, A Khosla, M Kim, J Nam, H Lee, A Y Ng. Multimodal deep learning. In: Proceedings of the 28th International Conference on Machine Learning. 2011, 689–696
6	S Poria, E Cambria, D Hazarika, N Mazumder, A Zadeh, L P Morency. Multi-level multiple attentions for contextual multimodal sentiment analysis. In: Proceedings of 2017 IEEE International Conference on Data Mining. 2017, 1033–1038 https://doi.org/10.1109/ICDM.2017.134
7	Y H H Tsai, S Bai, P P Liang, J Z Kolter, L P Morency, R Salakhutdinov. Multimodal transformer for unaligned multimodal language sequences. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019, 6558–6569 https://doi.org/10.18653/v1/P19-1656
8	K Xu, M Lam, J Pang, X Gao, C Band, P Mathur, F Papay, A K Khanna, J B Cywinski, K Maheshwari, et al. Multimodal machine learning for automated icd coding. In: Proceedings of Machine Learning for Healthcare Conference. 2019, 197–215
9	T Phan-Minh, E C Grigore, F A Boulton, O Beijbom, E M Wolff. Covernet: multimodal behavior prediction using trajectory sets. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020, 14074–14083 https://doi.org/10.1109/CVPR42600.2020.01408
10	X Geng. Label distribution learning. IEEE Transactions on Knowledge and Data Engineering, 2016, 28(7): 1734–1748 https://doi.org/10.1109/TKDE.2016.2545658
11	J Weston, S Bengio, N Usunier. Wsabie: scaling up to large vocabulary image annotation. In: Proceedings of the 22nd International Joint Conference on Artificial Intelligence. 2011
12	R Kiros, R Salakhutdinov, R S Zemel. Unifying visual-semantic embeddings with multimodal neural language models. 2014, arXiv preprint arXiv:1411.2539
13	J Wang, H T Shen, J Song, J Ji. Hashing for similarity search: a survey. 2014, arXiv preprint arXiv:1408.2927
14	N Rasiwasia, J C Pereira, E Coviello, G Doyle, G R Lanckriet, R Levy, N Vasconcelos. A new approach to cross-modal multimedia retrieval. In: Proceedings of the 18th ACM International Conference on Multimedia. 2010, 251–260 https://doi.org/10.1145/1873951.1873987
15	M E Sargin, Y Yemez, E Erzin, A M Tekalp. Audiovisual synchronization and fusion using canonical correlation analysis. IEEE Transactions on Multimedia, 2007, 9(7): 1396–1403 https://doi.org/10.1109/TMM.2007.906583
16	S Poria, I Chaturvedi, E Cambria, A Hussain. Convolutional MKL based multimodal emotion recognition and sentiment analysis. In: Proceedings of the 16th IEEE International Conference on Data Mining. 2016, 439–448 https://doi.org/10.1109/ICDM.2016.0055
17	A Zadeh, R Zellers, E Pincus, L P Morency. Multimodal sentiment intensity analysis in videos: facial gestures and verbal messages. IEEE Intelligent Systems, 2016, 31(6): 82–88 https://doi.org/10.1109/MIS.2016.94
18	E Morvant, A Habrard, S Ayache. Majority vote of diverse classifiers for late fusion. In: Proceedings of Joint IAPR International Workshops on Statistical Techniques in Pattern Recognition (SPR) and Structural and Syntactic Pattern Recognition (SSPR). 2014, 153–162 https://doi.org/10.1007/978-3-662-44415-3_16
19	G Potamianos, C Neti, G Gravier, A Garg, A W Senior. Recent advances in the automatic recognition of audiovisual speech. Proceedings of the IEEE, 2003, 91(9): 1306–1326 https://doi.org/10.1109/JPROC.2003.817150
20	G Evangelopoulos, A Zlatintsi, A Potamianos, P Maragos, K Rapantzikos, G Skoumas, Y Avrithis. Multimodal saliency and fusion for movie summarization based on aural, visual, and textual attention. IEEE Transactions on Multimedia, 2013 15(7): 1553–1568 https://doi.org/10.1109/TMM.2013.2267205
21	N Srivastava, R R Salakhutdinov. Multimodal learning with deep boltzmann machines. In: Proceedings of the 25th International Conference on Neural Information Processing Systems. 2012, 2222–2230
22	Y Mroueh, E Marcheret, V Goel. Deep multimodal learning for audiovisual speech recognition. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing. 2015, 2130–2134 https://doi.org/10.1109/ICASSP.2015.7178347
23	A Zadeh, P P Liang, S Poria, P Vij, E Cambria, L P Morency. Multiattention recurrent network for human communication comprehension. In: Proceedings of the 32nd AAAI Conference on Artificial Intelligence. 2018
24	A Zadeh, P P Liang, L P. MorencyFoundations of multimodal co-learning. Information Fusion, 2020, 64: 188–193 https://doi.org/10.1016/j.inffus.2020.06.001
25	X Geng, C Yin, Z H Zhou. Facial age estimation by learning from label distributions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013, 35(10): 2401–2412 https://doi.org/10.1109/TPAMI.2013.51
26	X Geng, Y Xia. Head pose estimation based on multivariate label distribution. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2014, 1837–1842 https://doi.org/10.1109/CVPR.2014.237
27	K Su, D Yu, Z Xu, X Geng, C Wang. Multi-person pose estimation with enhanced channel-wise and spatial information. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019, 5674–5682 https://doi.org/10.1109/CVPR.2019.00582
28	Y Ren, X Geng. Sense beauty by label distribution learning. In: Proceedings of the 26th International Joint Conference on Artificial Intelligence. 2017, 2648–2654 https://doi.org/10.24963/ijcai.2017/369
29	S Chen, J Wang, Y Chen, Z Shi, X Geng, Y Rui. Label distribution learning on auxiliary label space graphs for facial expression recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020, 13984–13993 https://doi.org/10.1109/CVPR42600.2020.01400
30	J Lv, M Xu, L Feng, G Niu, X Geng, M Sugiyama. Progressive identification of true labels for partial-label learning. In: Proceedings of International Conference on Machine Learning. 2020, 6500–6510
31	N Xu, A Tao, X Geng. Label enhancement for label distribution learning. In: Proceedings of the 26th International Joint Conference on Artificial Intelligence. 2018, 2926–2932 https://doi.org/10.24963/ijcai.2018/406
32	N Xu, Y P Liu, X Geng. Label enhancement for label distribution learning. IEEE Transactions on Knowledge and Data Engineering, 2021, 33(4): 1632–1643 https://doi.org/10.1109/TKDE.2019.2947040
33	N Xu, J Shu, Y P Liu, X Geng. Variational label enhancement. In: Proceedings of International Conference on Machine Learning. 2020, 10597–10606
34	A Vaswani, N Shazeer, N Parmar, J Uszkoreit, L Jones, A N Gomez, L Kaiser, I Polosukhin. Attention is all you need. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. 2017, 5998–6008
35	A Graves, N Jaitly, A R Mohamed. Hybrid speech recognition with deep bidirectional lstm. In: Proceedings of 2013 IEEEWorkshop on Automatic Speech Recognition and Understanding. 2013, 273–278 https://doi.org/10.1109/ASRU.2013.6707742
36	A B Zadeh, P P Liang, S Poria, E Cambria, L P Morency. Multimodal language analysis in the wild: cmu-mosei dataset and interpretable dy namic fusion graph. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. 2018, 2236–2246
37	J Pennington, R Socher, C D Manning. Glove: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. 2014, 1532–1543 https://doi.org/10.3115/v1/D14-1162
38	Y L Tian, T Kanade, J F Cohn. Facial expression analysis. In: Handbook of Face Recognition. Springer, New York, 2005
39	G Degottex, J Kane, T Drugman, T Raitio, S Scherer. Covarep—a collaborative voice analysis repository for speech technologies. In: Proceedings of 2014 IEEE International Conference on Acoustics, Speech and Signal Processing. 2014, 960–964 https://doi.org/10.1109/ICASSP.2014.6853739
40	J Yuan, M Liberman. Speaker identification on the scotus corpus. Journal of the Acoustical Society of America, 2008, 123(5): 3878 https://doi.org/10.1121/1.2935783
41	Y Wang, Y Shen, Z Liu, P P Liang, A Zadeh, L P Morency. Words can shift: dynamically adjusting word representations using nonverbal behaviors. In: Proceedings of the AAAI Conference on Artificial Intelligence. 2019, 7216–7223 https://doi.org/10.1609/aaai.v33i01.33017216
42	A E Johnson, T J Pollard, L Shen, H L Li-Wei, M Feng, G Ghassemi, B Moody, P Szolovits, L A Celi, G Roger R G Mark. Mimic-iii, a freely accessible critical care database. Scientific Data, 2016, 3:160035 https://doi.org/10.1038/sdata.2016.35
43	E Choi, M T Bahadori, L Song, W F Stewart, J Sun. Gram: graph-based attention model for healthcare representation learning. In: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2017, 787–795 https://doi.org/10.1145/3097983.3098126
44	T Mikolov, I Sutskever, K Chen, G S Corrado, J Dean. Distributed representations of words and phrases and their compositionality. In: Proceedings of the 26th International Conference on Neural Information Processing Systems. 2013, 3111–3119
45	M Schuster, K K Paliwal. Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing, 1997, 45(11): 2673–2681 https://doi.org/10.1109/78.650093
46	E Choi, M T Bahadori, A Schuetz, W F Stewart, J Sun. Doctor AI: predicting clinical events via recurrent neural networks. In: Proceedings of Machine Learning for Healthcare Conference. 2016, 301–318

[1]

Download

[1]	Huiying ZHANG, Yu ZHANG, Xin GENG. Practical age estimation using deep label distribution learning[J]. Front. Comput. Sci., 2021, 15(3): 153318-.
[2]	Ebuka IBEKE, Chenghua LIN, Adam WYNER, Mohamad Hardyman BARAWI. A unified latent variable model for contrastive opinion mining[J]. Front. Comput. Sci., 2020, 14(2): 404-416.
[3]	Miaogen LING, Xin GENG. Soft video parsing by label distribution learning[J]. Front. Comput. Sci., 2019, 13(2): 302-317.
[4]	Yang-Yen OU, Ta-Wen KUAN, Anand PAUL, Jhing-Fa WANG, An-Chao TSAI. Spoken dialog summarization system with HAPPINESS/SUFFERING factor recognition[J]. Front. Comput. Sci., 2017, 11(3): 429-443.
[5]	Rongrong JI,Donglin CAO,Yiyi ZHOU,Fuhai CHEN. Survey of visual sentiment prediction for social media analysis[J]. Front. Comput. Sci., 2016, 10(4): 602-611.
[6]	Wenge RONG,Baolin PENG,Yuanxin OUYANG,Chao LI,Zhang XIONG. Structural information aware deep semi-supervised recurrent neural network for sentiment analysis[J]. Front. Comput. Sci., 2015, 9(2): 171-184.

Viewed

Full text

Abstract

Cited

Shared

Discussed