1. School of Computer Science and Engineering, Northeastern University, Shenyang 110000, China 2. Institute of Science and Technology Brain-Inspired Intelligence, Fudan University, Shanghai 200082, China 3. School of Computer Science & Technology, Beijing Institute of Technology, Beijing 100089, China
Music is the language of emotions. In recent years, music emotion recognition has attracted widespread attention in the academic and industrial community since it can be widely used in fields like recommendation systems, automatic music composing, psychotherapy, music visualization, and so on. Especially with the rapid development of artificial intelligence, deep learning-based music emotion recognition is gradually becoming mainstream. This paper gives a detailed survey of music emotion recognition. Starting with some preliminary knowledge of music emotion recognition, this paper first introduces some commonly used evaluation metrics. Then a three-part research framework is put forward. Based on this three-part research framework, the knowledge and algorithms involved in each part are introduced with detailed analysis, including some commonly used datasets, emotion models, feature extraction, and emotion recognition algorithms. After that, the challenging problems and development trends of music emotion recognition technology are proposed, and finally, the whole paper is summarized.
Predict the categorical emotion labels of music pieces
Dimensional approach
Predict the numerical emotion values of music pieces
MEVD
Categorical approach
Predict the dynamic categorical emotion variation within a music piece
Dimensional approach
Predict the dynamic dimensional emotion variation within a music piece
Tab.1
Fig.1
Model name
Application domain
Emotion conceptualization
Number of classes/dimensions
Emotional definition
Hevner affective ring [7]
Music
Categorical
67
Perceived
Russell’s circumplex model of affect [8,9]
General
Dimensional
2
Perceived
GEMS [10]
Music
Categorical
45
Induced
Thayer [11]
General
Dimensional
2
Perceived
Tab.2
Dataset name
Emotion conceptualization
Number of songs
Data type
Genres
Research directions
MediaEval emotion in music [19]
Dimensional
1000
MP3
Rock, pop, soul, blues, etc.
Dynamic
CAL500 [20]
Categorical
500
MP3
?
Static
CAL500exp [21]
Categorical
3223(segments)
MP3
?
Dynamic
AMG1608[22]
Dimensional
1608
WAV
Rock, metal, country, jazz, etc.
Static
DEAM [23]
Dimensional
1802
MP3
Rock, pop, electronic, etc.
Dynamic
MTurk [24]
Dimensional
240
?
?
Dynamic
Soundtracks [25]
Categorical and dimensional
360
MP3
Rap, R&B, electronic, etc.
Static
Emotify music database [26]
Categorical
400
MP3
Rock, classical, pop and electronic
Static
Tab.3
Data format
Preprocessing method
Preprocessing tools
Preprocessing result
WAV, MP3
Framing, windowing, MFCC extraction, spectrogram extraction, etc.
Psysound (software), MIRtoolbox (MATLAB), Librosa (Python package), etc.
MFCC, spectrogram, etc.
MIDI
Main track extraction, etc.
pretty_music, music21 (all Python package)
Key, BPM, melody, etc.
Text
Segmentation, cleaning, normalization, etc.
NLTK, Gensim, Jieba, Stanford NLP (all Python package), etc.
BOW, TF-IDF, word embeddings, etc.
Tab.4
First level audio feature
Second level audio features
Rhythmic features
Duration, pitch, energy, etc. [33]
Timbre features
MFCC, zero crossing rate, chroma, etc. [32]
Spectral features
Spectral flatness measure, spectral centroid, etc. [32]
Tab.5
Reference
Feature modalities
Machine learning model
Emotion model
Dataset
[39]
Lyric
SVM
18 classes
Self-built
[47]
Physiological signals
SVM, NB, KNN, DT
3 classes
Peripheral physiological signals data
[49]
Audio
SVM
13/6 classes
Self-built
[50]
Audio and lyric
SVM, RF
4 classes
Self-built
[51]
Audio and lyric
SVM
4 classes
Self-built
[52]
Audio
CLR
6/18 classes
EMOTIONS, CAL500
Tab.6
Reference
Feature modalities
Machine learning model
Emotion model
Dataset
[13]
Audio
SVR, MLR
VA model
Self-built
[42]
Lyric
SVM
VA model
Self-built
[53]
Audio
AEG
VA model
MTurk, MER60
[54]
Audio
AEG
VA model
AMG1608
[55]
Audio
LR
VA model
AMG240
[56]
Audio
GPR
VA model
MediaEval emotion in music
[57]
Audio
SVR, MLR, GPR
VA model
Self-built
Tab.7
Reference
Feature modalities
Machine learning model
Emotion model
Dataset
[13]
Audio
SVR
VA model
Self-built
[57]
Audio
SVR, MLR, GPR
VA model
Self-built
[58]
Audio
GMM
4 classes
Self-built
[59]
Audio
SVM, SVR
VA model
Self-built
[60]
Audio
DS-SVR
VA model
MediaEval emotion in music
Tab.8
Reference
Year
Feature modalities
Learning model
Emotion model
Dataset
[61]
2016
Audio and lyric
DBM
4 classes
MSD
[32]
2017
Audio
CNN
18 classes
CAL500, CAL500exp
[62]
2019
EEG
CNN
2 classes
EEG data collected from subjects
[63]
2020
Audio
VGGNet
4 classes
Soundtracks, Bi-Modal
[64]
2020
Audio
CNN
2 classes
EmoMusic
Tab.9
Reference
Year
Feature modalities
Learning model
Emotion model
Dataset
[65]
2017
Audio
LSTM, attention mechanism
VA model
Emotion in Music task at MediaEval 2015
[18]
2018
Audio
BiLSTM
Based on VA model
DEAM
[66]
2018
Audio
GARN
GEMS
Emotify music database
[67]
2018
Audio and lyric
CNN, RNN, etc.
VA model
MSD
[68]
2019
Audio
BCRSN
Based on VA model
DEAM, MTurk
[69]
2019
Audio
VGG-based
8 dimensions
Mid-level Perceptual Features dataset, Soundtracks
Tab.10
Reference
Year
Feature modalities
Learning model
Emotion model
Dataset
[12]
2014
Audio
LSTM, SVR
VA model
MediaEval Emotion in Music
[57]
2014
Audio
SVR, MLR, GPR
VA model
Self-built
[70]
2016
Audio
DBLSTM
VA model
MediaEval Emotion in Music
[71]
2020
Audio
Attentive LSTM
VA model
MediaEval Emotion in Music
Tab.11
Reference
Method
Dataset
Performance
[61]
Bi-modal deep boltzmann machine
MSD
78.5% (Accuracy)
[67]
CNN, LSTM
MSD
0.219 for valence, 0.232 for arousal (R2)
[65]
MCA
MediaEval dataset
0.291 for valence, 0.241 for arousal (RMSE)
[70]
DBLSTM
MediaEval dataset
0.285 for valence, 0.225 for arousal (RMSE)
[32]
CNN
CAL500
42.6% (Marco average precision)
[52]
CLR
CAL500
48.8% (Marco average precision)
Tab.12
Year
Method
Accuracy/%
2020
Mel spectrogram + CNN
69.5
2019
-
68
2018
STFT + CNN
61.17
2017
Mel spectrogram + DCNN+SVM
69.83
2016
FFT, MFCC + CNN
63.33
2015
-
66.17
2014
MFCC + SVM
66.33
2013
Visual and acoustic features + SVM
68.33
2012
Audio features + SVM based models
67.83
2011
Audio features + SRC
69.5
Tab.13
1
X Y Yang , Y Z Dong , J Li . Review of data features-based music emotion recognition methods. Multimedia System, 2018, 24( 4): 365– 389
2
Z Y Cheng, J L Shen, L Zhu, M Kankanhalli, L Q Nie. Exploiting music play sequence for music recommendation. In: Proceedings of the 26th International Joint Conference on Artificial Intelligence. 2017, 3654−3660
3
Z Y Cheng, J L Shen, L Q Nie, T S Chua, M Kankanhalli. Exploring user-specific information in music retrieval. In: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2017, 655– 664
4
Y E Kim, E M Schmidt, R Migneco, B G Morton, P Richardson, J Scott, J A Speck, D Turnbull. Music emotion recognition: a state of the art review. In: Proceedings of the 11th International Society for Music Information Retrieval Conference. 2010, 255– 266
5
Yang Y H, Chen H H. Machine recognition of music emotion: a review. ACM Transactions on Intelligent Systems and Technology. 2011, 3(3): 1−30
6
M Bartoszewski, H Kwasnicka, M U Kaczmar, P B Myszkowski. Extraction of emotional content from music data. In: Proceedings of the 7th International Conference on Computer Information Systems and Industrial Management Applications. 2008, 293– 299
7
K Hevner . Experimental studies of the elements of expression in music. The American Journal of Psychology, 1936, 48( 2): 246– 268
8
J A Russell . A circumplex model of affect. Journal of Personality and Social Psychology, 1980, 39( 6): 1161– 1178
9
J Posner , J A Russell , B S Peterson . The circumplex model of affect: an integrative approach to affective neuroscience, cognitive development, and psychology. Development and Psychopathology, 2005, 17( 3): 715– 734
10
Chekowska-Zacharewicz M, Janowski M. Polish adaptation of the geneva emotional music scale (GEMS): factor structure and reliability. Psychology of Music, 2020, 57(6): 427−438
11
R Thayer. The Biopsychology of Mood and Arousal. 1st ed. Oxford: Oxford University Press, 1989
12
F Weninger, F Eyben, B W Schuller. On-line continuous-time music mood regression with deep recurrent neural networks. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing. 2014, 5412−5416
13
Y H Yang , Y C Lin , Y F Su , H H Chen . A regression approach to music emotion recognition. IEEE Transactions on Audio, Speech, and Language Processing, 2008, 16( 2): 448– 457
14
X X Li, H S Xianyu, J S Tian, W X Chen, F H Meng, M X Xu, L H Cai. A deep bidirectional long short-term memory based multi-scale approach for music dynamic emotion prediction. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal. 2016, 544– 548
15
J Y Fan, K Tatar, M Thorogood, P Pasquier. Ranking-based emotion recognition for experimental music. In: Proceedings of the 18th International Society for Music Information Retrieval Conference. 2017, 368– 375
16
N Thammasan, K I Fukui, M Numao. Multimodal fusion of EEG and musical features in music-emotion recognition. In: Proceedings of the 31st AAAI Conference on Artificial Intelligence. 2017, 4991−4992
17
Y H Yang , H H Chen . Prediction of the distribution of perceived music emotions using discrete samples. IEEE Transactions on Audio, Speech and Language Processing, 2011, 19( 7): 2184– 2196
18
H P Liu, Y Fang, Q H Huang. Music emotion recognition using a variant of recurrent neural network. In: Proceedings of the International Conference on Matheatics, Modeling, Simulation and Statistics Application. 2018, 15− 18
19
M Soleymani, M N Caro, E M Schmidt, C Y Sha, Y H Yang. 1000 songs for emotional analysis of music. In: Proceedings of the 2nd ACM International Workshop on Crowdsourcing for Multimedia. 2013, 1– 6
20
D Turnbull, L Barrington, D Torres, G Lanckriet. Towards musical query-by-semantic-description using the CAL500 data set. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 2007, 439– 446
21
S Y Wang, J C Wang, Y H Yang, H M Wang. Towards time-varying music auto-tagging on CAL500 expansion. In: Proceedings of the IEEE International Conference on Multimedia and Expo. 2014, 1– 6
22
Y A Chen, Y H Yang, J C Wang, H Chen. The AMG1608 dataset for music emotion recognition. In: Proceedings of the IEEE International Conference on Acoustic, Speech and Signal Processing. 2015, 693– 697
23
A Aljanaki , Y H Yang , M Soleymani . Developing a benchmark for emotional analysis of music. PLoS ONE, 2017, 12( 3): e0173392–
24
J A Speck, E M Schmidt, B G Morton, Y E Kim. A comparative study of collaborative vs. traditional musical mood annotation. In: Proceedings of the 12th International Society for Music Informational Retrieval Conference. 2011, 549– 554
25
T Eerola , J K Vuoskoski . A comparison of the discrete and dimensional models of emotion in music. Psychology Music, 2011, 39( 1): 18– 49
26
M Zentner , D Grandjean , K R Scherer . Emotions evoked by the sound of music: characterization, classification, and measurement. Emotion, 2008, 8( 4): 494– 521
27
T B Mahieux, D P W Ellis, B Whitman, P Lamere. The million songs dataset. In: Proceedings of the 12th International Society for Music Information Retrieval Conference. 2011, 591– 596
28
G Tzanetakis , P Cook . MARSYAS: a framework for audio analysis. Organised Sound, 2000, 4( 3): 169– 175
29
B Mathieu, S Essid, T Fillon, J Prado, G Richard. YAAFE, an easy to use and efficient audio feature extraction software. In: Proceedings of the 11th International Society for Music Information Retrieval Conference. 2010, 441– 446
30
O Lartillot, P Toiviainen. MIR in MATLAB (Ⅱ)A toolbox for musical feature extraction from audio. In: Proceedings of the 8th International Conference on Music Information Retrieval. 2007, 127– 130
31
D McEnnis, C Mckay, I Fujinaga, P Depalle. jAudio: a feature extraction library. In: Proceedings of the 6th International Conference on Music Information Retrieval. 2005, 600– 603
32
X Liu, Q C Chen, X P Wu, Y Liu, Y Liu. CNN based music emotion classification. 2017, arXiv preprint arXiv: 1704.5665
33
W J Han , H F Li , H B Ruan , Lin Ma . Review on speech emotion recognition (In Chinese). Journal of Software, 2014, 25( 1): 37– 50
34
M Barthet, G Fazekas, M Sandler. Multidisciplinary perspectives on music emotion recognition: implications for content and context-based model. In: Proceedings of the 9th International Symposium on Computer Music Modelling and Retrieval. 2012, 492– 507
35
P L Chen, L Zhao, Z Y Xin, Y M Qiang, M Zhang, T M Li. A scheme of MIDI music emotion classification based on fuzzy theme extraction and neural network. In: Proceedings of the 12th International Conference on Computational Intelligence and Security. 2016, 323– 326
36
P N Juslin , P Laukka . Expression, perception, and induction of musical emotions: a review and a questionnaire study of everyday listening. Journal of New Music Research, 2004, 33( 3): 217– 238
37
D Yang, W S Lee. Disambiguating music emotion using software agents. In: Proceedings of the 5th International Conference on Music Information Retrieval. 2004, 218– 223
38
H He, J M Jin, Y H Xiong, B Chen, L Zhao. Language feature mining for music emotion classification via supervised learning from lyrics. In: Proceedings of International Symposium on Intelligence Computation and Applications. 2008, 426– 435
39
X Hu, J S Downie, A F Ehmann. Lyric text mining in music mood classification. In: Proceedings of the 10th International Society for Music Information Retrieval Conference. 2009, 411– 416
40
M V Zaanen, P Kanters. Automatic mood classification using TF*IDF based on lyrics. In: Proceedings of the 11th International Society for Music Information Retrieval Conference. 2010, 75– 80
41
X Wang, X O Chen, D S Yang, Y Q Wu. Music emotion classification of Chinese songs based on lyrics using TF*IDF and rhyme. In: Proceedings of the 12th International Society for Music Information Retrieval Conference. 2011, 765– 770
42
R Malheiro , R Panda , P Gomes , R P Paiva . Emotionally-relevant features for classification and regression of music lyrics. IEEE Transactions on Affective Computing, 2018, 9( 2): 240– 254
43
Y J Hu, X O Chen, D S Yang. Lyric-based song emotion detection with affective lexicon and fuzzy clustering method. In: Proceedings of the 10th International Society for Music Information Retrieval Conference. 2009, 123– 128
44
D Yang, W S Lee. Music emotion identification from lyrics. In: Proceedings of the 11th IEEE International Symposium on Multimedia. 2009, 624– 629
45
K Dakshina , R Sridhar . LDA based emotion recognition from lyrics. Advanced Computing, Networking and Informatics, 2014, 27( 1): 187– 194
46
N Thammasan, K I Fukui, M Numao. Application of deep belief networks in EEG-based dynamic music-emotion recognition. In: Proceedings of the 2016 International Joint Conference on Neural Networks. 2016, 881– 888
47
X Hu, F J Li, D T J Ng. On the relationships between music-induced emotion and physiological signals. In: Proceedings of the 19th International Society for Music Information Retrieval Conference. 2018, 362– 369
48
N E Nawa, D E Callan, P Mokhtari, H Ando, J Iversen. Decoding music-induced experienced emotions using functional magnetic resonance imaging- Preliminary result. In: Proceedings of the 2018 International Joint Conference on Neural Networks. 2018, 1– 7
49
T Li, M Ogihara. Detecting emotion in music. In: Proceedings of the 4th International Conference on Music Information Retrieval. 2003, 239– 240
50
C Laurier, J Grivolla, P Herrera. Multimodal music mood classification using audio and lyrics. In: Proceedings of the 7th International Conference on Machine Learning and Applications. 2008, 688– 693
51
Y H Yang, Y C Lin, H T Cheng, I B Liao, Y C Ho, H Chen. Toward multi-modal music emotion classification. In: Proceedings of the 9th Pacific Rim Conference on Multimedia. 2008, 70– 79
52
Y Liu , Y Liu , Y Zhao , K A Hua . What strikes the strings of your heart? – feature mining for music emotion analysis.. IEEE Transactions on Affective Computing, 2015, 6( 3): 247– 260
53
J C Wang, Y H Yang, H M Wang, S K Jeng. The acoustic emotion gaussians model for emotion-based music annotation and retrieval. In: Proceedings of the 20th ACM Multimedia Conference. 2012, 89– 98
54
Y A Chen , J C Wang , Y H Yang , H Chen . Component tying for mixture model adaptation in personalization of music emotion recognition. IEEE ACM Transactions on Audio, Speech and Language Processing, 2017, 25( 7): 1409– 1420
55
Y A Chen, J C Wang, Y H Yang, H Chen. Linear regression-based adaptation of music emotion recognition models for personalization. In: Proceedings of the IEEE International Conference on Acoustic, Speech and Signal Processing. 2014, 2149−2153
56
S Fukayama, M Goto. Music emotion recognition with adaptive aggregation of Gaussian process regressors. In: Proceedings of the IEEE International Conference on Acoustic, Speech and Signal Processing. 2016, 71– 75
57
M Soleymani, A Aljanaki, Y H Yang, M N Caro, F Eyben, K Markov, B Schuller, R C Veltkamp, F Weninger, F Wiering. Emotional analysis of music: a comparison of methods. In: Proceedings of the ACM International Conference on Multimedia. 2014, 1161−1164
58
L Lu , D Liu , H J Zhang . Automatic mood detection and tracking of music audio signals. IEEE Transactions on Audio, Speech and Language Processing, 2006, 14( 1): 5– 18
59
E M Schmidt, D Turnbull, Y E Kim. Feature selection for content-based, time-varying musical emotion regression. In: Proceedings of the 11th ACM SIGMM International Conference on Multimedia Information Retrieval. 2010, 267– 274
60
H S Xianyu, X X Li, W S Chen, F H Meng, J S Tian, M X Xu, L H Cai. SVR based double-scale regression for dynamic emotion prediction in music. In: Proceedings of the 2016 IEEE International Conference on Acoustic, Speech and Signal Processing. 2016, 549– 553
61
M Y Huang, W G Rong, T Arjannikov, J Nan, Z Xiong. Bi-modal deep Boltzmann machine based musical emotion classification. In: Proceedings of the 25th International Conference on Artificial Neural Network. 2016, 199– 207
62
P Keelawat, N Thammasan, B Kijsirikul, M Numao. Subject-independent emotion recognition during music listening based on EEG using deep convolutional neural networks. In: Proceedings of the 2019 the 15th IEEE International Colloquium on Signal Processing & Its Application. 2019, 21– 26
63
R Sarkar , S Choudhury , S Dutta , A Roy , S K Saha . Recognition of emotion in music based on deep convolutional neural network. Multimedia Tools and Application, 2020, 79( 9): 765– 783
64
P T Yang, S M Kuang, C C Wu, J L Hsu. Predicting music emotion by using convolutional neural network. In: Proceedings of the 22nd HCI International Conference. 2020, 266– 275
65
Y Ma, X X Li, M X Xu, J Jia, L H Cai. Multi-scale context based attention for dynamic music emotion prediction. In: Proceedings of the 25th ACM International Conference on Multimedia Conference. 2017, 1443−1450
66
W H Chang, J L Li, Y S Lin, C C Lee. A genre-affect relationship network with task-specific uncertainty weighting for recognizing induced emotion in music. In: Proceedings of the 2018 IEEE International Conference on Multimedia and Expo. 2018, 1– 8
67
R Delbouys, R Hennequin, F Piccoli, J R Letelier, M Moussallam. Music mood detection based on audio and lyrics with deep neural net. In: Proceedings of the 19th International Society for Music Information Retrieval Conference. 2018, 370– 375
68
Y Z Dong , X Y Yang , X Zhao , J Li . Bidirectional convolutional recurrent sparse network (BCRSN): an efficient model for music emotion recognition. IEEE Transactions on Multimedia, 2019, 21( 12): 3150– 3163
69
S Chowdhury, A Vall, V Haunscmid, G Widmer. Towards explainable music emotion recognition: the route via mid-level features. In: Proceedings of the 20th International Society for Music Information Retrieval Conference. 2019, 237– 243
70
X X Li, J S Tian, M X Xu, Y S Ning, L H Cai. DBLSTM-based multi-scale fusion for dynamic emotion prediction in music. In: Proceedings of the IEEE International Conference on Multimedia and Expo. 2016, 1– 6
71
S Chaki, P Doshi, P Patnaik, S Bhattacharya. Attentive RNNs for continuous-time emotion prediction in music clips. In: Proceedings of the 3rd Workshop in Affective Content Analysis co-located with 34th AAAI Conference on Artificial Intelligence. 2020, 36– 45
72
R Panda , R Malheiro , R P Paiva . Novel audio features for music emotion recognition. IEEE Transactions on Affective Computing, 2020, 11( 4): 614– 626
73
S G Deng , D J Wang , X T Li , G D Xu . Exploring user emotion in microblogs for music recommendation. Expert System with Applications, 2015, 42( 1): 9284– 9293
74
L N Ferreira, J Whitehead. Learning to generate music with sentiment. In: Proceedings of the 20th International Society for Music Information Retrieval Conference. 2019, 384– 390