Please wait a minute...
Frontiers of Computer Science

ISSN 2095-2228

ISSN 2095-2236(Online)

CN 10-1014/TP

邮发代号 80-970

2019 Impact Factor: 1.275

Frontiers of Computer Science  2022, Vol. 16 Issue (3): 163703   https://doi.org/10.1007/s11704-020-0133-7
  本期目录
Speech-driven facial animation with spectral gathering and temporal attention
Yujin CHAI1, Yanlin WENG1(), Lvdi WANG2, Kun ZHOU1
1. State Key Lab of CAD&CG, Zhejiang University, Hangzhou 310058, China
2. FaceUnity Technology Inc., Hangzhou 310011, China
 全文: PDF(10280 KB)   HTML
Abstract

In this paper, we present an efficient algorithm that generates lip-synchronized facial animation from a given vocal audio clip. By combining spectral-dimensional bidirectional long short-term memory and temporal attention mechanism, we design a light-weight speech encoder that learns useful and robust vocal features from the input audio without resorting to pre-trained speech recognition modules or large training data. To learn subject-independent facial motion, we use deformation gradients as the internal representation, which allows nuanced local motions to be better synthesized than using vertex offsets. Compared with state-of-the-art automatic-speech-recognition-based methods, our model is much smaller but achieves similar robustness and quality most of the time, and noticeably better results in certain challenging cases.

Key wordsspeech-driven facial animation    spectral-dimensional bidirectional long short-term memory    temporal attention    deformation gradients
收稿日期: 2020-03-31      出版日期: 2021-09-18
Corresponding Author(s): Yanlin WENG   
 引用本文:   
. [J]. Frontiers of Computer Science, 2022, 16(3): 163703.
Yujin CHAI, Yanlin WENG, Lvdi WANG, Kun ZHOU. Speech-driven facial animation with spectral gathering and temporal attention. Front. Comput. Sci., 2022, 16(3): 163703.
 链接本文:  
https://academic.hep.com.cn/fcs/CN/10.1007/s11704-020-0133-7
https://academic.hep.com.cn/fcs/CN/Y2022/V16/I3/163703
Fig.1  
Layer Kernel size 2) Stride 2) Activation Output shape
Mel spectrogram ? ? ? 3 × 128 × 64
Convolution2d 3 × 1 1 × 1 lrelu:0.2 1) 32 × 128 × 64
MaxPool2d 2 × 1 2 × 1 ? 32 × 64 × 64
Convolution2d 3 × 1 1 × 1 lrelu:0.2 64 × 64 × 64
MaxPool2d 2 × 1 2 × 1 ? 64 × 32 × 64
Convolution2d 1 × 1 1 × 1 lrelu:0.2 64 × 32 × 64
Spec-BiLSTM ? ? ? 64 × 32 × 64
Frequency stack ? ? ? 2048 × 1 × 64
Fully connected ? ? ? 256 × 1 × 64
Squeezing ? ? ? 256 × 64
Tab.1  
Fig.2  
Layer Output shape
Time-BiLSTM 512 × 64
Time-BiLSTM 512 × 64
Attention 512 × 1
Squeezing 512
Tab.2  
Layer Activation Output shape (scaling/shear) Output shape (rotation)
Identity concat ? 520 (shared)
Fully connected lrelu:0.2 512 (shared)
Identity concat ? 520 520
Fully connected lrelu:0.2 512 512
Fully connected tanh 256 256
Fully connected ? 85 180
Inverse PCA ? 59865 29928
Tab.3  
Fig.3  
Fig.4  
Fig.5  
Fig.6  
Fig.7  
Fig.8  
Fig.9  
Fig.10  
Stage Karras et al.’s [ 4] VOCA [ 6] Ours
Set static mesh ? ? 13.15
Preprocess audio 1,672.15 8.94 553.92
Get audio feature 376.09 9,253.29 4,200.69
Get anime feature 51.06 578.35 545.03
Reconstruct mesh ? ? 589.57
Total (with loading) 2,531.80 11,999.58 6,425.50
Tab.4  
1 C Cao , Q Hou , K Zhou . Displaced dynamic expression regression for real-time facial tracking and animation. ACM Transactions on Graphics, 2014, 33( 4): 1– 10
2 Nagano K, Saito S, Goldwhite L, San K, Hong A, Hu L, Wei L, Xing J, Xu Q, Kung H W, Kuang J, Agarwal A, Castellanos E, Seo J, Fursund J, Li H. Pinscreen avatars in your pocket: mobile pagan engine and personalized gaming. In: Proceedings of SIGGRAPH Asia 2018 RealTime Live!. 2018, 1–1
3 P Edwards , C Landreth , E Fiume , K Singh . JALI: an animator-centric viseme model for expressive lip synchronization. ACM Transactions on Graphics, 2016, 35( 4): 1– 11
4 T Karras , T Aila , S Laine , A Herva , J Lehtinen . Audio-driven facial animation by joint end-to-end learning of pose and emotion. ACM Transactions on Graphics, 2017, 36( 4): 1– 12
5 Pham H X, Wang Y, Pavlovic V. End-to-end learning for 3d facial animation from speech. In: Proceedings of the 20th ACM International Conference on Multimodal Interaction. 2018, 361–365
6 Cudeiro D, Bolkart T, Laidlaw C, Ranjan A, Black M J. Capture, learning, and synthesis of 3d speaking styles. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019, 10101–10111
7 Hati Y, Rousseaux F, Duhart C. Text-driven mouth animation for human computer interaction with personal assistant. In: Proceedings of the 25th International Conference on Auditory Display. 2019, 75–82
8 Jurafsky D, Martin J H. Speech and Language Processing : An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition (2nd Edition). Upper Saddle River, New Jersey: Pearson Prentice Hall, 2009
9 S Suwajanakorn , S M Seitz , I Kemelmacher-Shlizerman . Synthesizing obama: learning lip sync from audio. ACM Transactions on Graphics, 2017, 36( 4): 1– 13
10 S Taylor , T Kim , Y Yue , M Mahler , J Krahe , A G Rodriguez , J Hodgins , I Matthews . A deep learning approach for generalized speech animation. ACM Transactions on Graphics, 2017, 36( 4): 1– 11
11 Hussen Abdelaziz A, Theobald B J, Binder J, Fanelli G, Dixon P, Apostoloff N, Weise T, Kajareker S. Speaker-independent speech-driven visual speech synthesis using domain-adapted acoustic models. In: Proceedings of the 2019 International Conference on Multimodal Interaction. 2019, 220–225
12 S Hochreiter , J Schmidhuber . Long short-term memory. Neural Computation, 1997, 9( 8): 1735– 1780
13 Hannun A, Case C, Casper J, Catanzaro B, Diamos G, Elsen E, Prenger R, Satheesh S, Sengupta S, Coates A, et al. Deep speech: scaling up end-to-end speech recognition. arXiv preprint arXiv: 14125567, 2014
14 Pham H X, Cheung S, Pavlovic V. Speech-driven 3d facial animation with implicit emotional awareness: a deep learning approach. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. 2017, 2328–2336
15 Tian G, Yuan Y, Liu Y. Audio2face: generating speech/face animation from single audio with attention-based bidirectional lstm networks. In: Proceedings of 2019 IEEE International Conference on Multimedia and Expo Workshops. 2019, 366–371
16 Tzirakis P, Papaioannou A, Lattas A, Tarasiou M, Schuller B, Zafeiriou S. Synthesising 3d facial motion from “in-the-wild” speech. arXiv preprint arXiv: 190407002, 2019
17 Nishimura R, Sakata N, Tominaga T, Hijikata Y, Harada K, Kiyokawa K. Speech-driven facial animation by lstm-rnn for communication use. In: Proceedings of 2019 IEEE Conference on Virtual Reality and 3D User Interfaces. 2019, 1102–1103
18 R W Sumner , J Popović . Deformation transfer for triangle meshes. ACM Transactions on Graphics, 2004, 23( 3): 399– 405
19 Wu Q, Zhang J, Lai Y K, Zheng J, Cai J. Alive caricature from 2d to 3d. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018, 7336–7345
20 L Gao , Y Lai , J Yang , L X Zhang , L Kobbelt , S Xia . Sparse data driven mesh deformation. IEEE Transactions on Visualization and Computer Graphics, 2019,
21 Orvalho V, Bastos P, Parke F I, Oliveira B, Alvarez X. A facial rigging survey. In: Proceedings of Eurographics 2012 - State of the Art Reports. 2012, 183–204
22 R D Kent , F D Minifie . Coarticulation in recent speech production models. Journal of Phonetics, 1977, 5( 2): 115– 133
23 C Pelachaud , N I Badler , M Steedman . Generating facial expressions for speech. Cognitive Science, 1996, 20( 1): 1– 46
24 Wang A, Emmi M, Faloutsos P. Assembling an expressive facial animation system. In: Proceedings of the 2007 ACM SIGGRAPH Symposium on Video Games. 2007, 21–26
25 Cohen M M, Massaro D W. Modeling coarticulation in synthetic visual speech. In: Proceedings of Models and Techniques in Computer Animation. 1993, 139–156
26 Xu Y, Feng A W, Marsella S, Shapiro A. A practical and configurable lip sync method for games. In: Proceedings of Motion on Games. 2013, 131–140
27 Bregler C, Covell M, Slaney M. Video rewrite: driving visual speech with audio. In: Proceedings of the 24th Annual Conference on Computer Graphics and Interactive Techniques. 1997, 353–360
28 T Ezzat , G Geiger , T Poggio . Trainable videorealistic speech animation. ACM Transactions on Graphics, 2002, 21( 3): 388– 398
29 Taylor S L, Mahler M, Theobald B J, Matthews I. Dynamic units of visual speech. In: Proceedings of the 11th ACM SIGGRAPH/ Eurographics Conference on Computer Animation. 2012, 275–284
30 Brand M. Voice puppetry. In: Proceedings of the 26th Annual Conference on Computer Graphics and Interactive Techniques. 1999, 21–28
31 L Xie , Z Q Liu . Realistic mouth-synching for speech-driven talking face using articulatory modelling. IEEE Transactions on Multimedia, 2007, 9( 3): 500– 510
32 Wang L, Han W, Soong F K, Huo Q. Text driven 3d photo-realistic talking head. In: Proceedings of Interspeech. 2011, 3307–3308
33 Zhang X, Wang L, Li G, Seide F, Soong F K. A new language independent, photo-realistic talking head driven by voice only. In: Proceedings of Interspeech. 2013, 2743–2747
34 Shimba T, Sakurai R, Yamazoe H, Lee J H. Talking heads synthesis from audio with deep neural networks. In: Proceedings of the IEEE/SICE International Symposium on System Integration. 2015, 100–105
35 B Fan , L Xie , S Yang , L Wang , F K Soong . A deep bidirectional lstm approach for video-realistic talking head. Multimedia Tools and Applications, 2016, 75( 9): 5287– 5309
36 Eskimez S E, Maddox R K, Xu C, Duan Z. Generating talking face landmarks from speech. In: Proceedings of the International Conference on Latent Variable Analysis and Signal Separation. 2018, 372–381
37 Aneja D, Li W. Real-time lip sync for live 2d animation. arXiv preprint arXiv: 191008685, 2019
38 Greenwood D, Matthews I, Laycock S. Joint learning of facial expression and head pose from speech. In: Proceedings of Interspeech. 2018, 2484–2488
39 Websdale D, Taylor S, Milner B. The effect of real-time constraints on automatic speech animation. In: Proceedings of Interspeech. 2018, 2479–2483
40 J L Schwartz , C Savariaux . No, there is no 150 ms lead of visual speech on auditory speech, but a range of audiovisual asynchronies varying from small audio lead to large audio lag. PLOS Computational Biology, 2014, 10( 7): e1003743–
41 Shen J, Pang R, Weiss R J, Schuster M, Jaitly N, Yang Z, Chen Z, Zhang Y, Wang Y, Skerrv-Ryan R, et al. Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. 2018, 4779–4783
42 Prenger R, Valle R, Catanzaro B. Waveglow: a flow-based generative network for speech synthesis. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. 2019, 3617–3621
43 K Vougioukas , S Petridis , M Pantic . Realistic speech-driven facial animation with gans. International Journal of Computer Vision, 2020, 128( 5): 1398– 1413
44 Chen L, Maddox R K, Duan Z, Xu C. Hierarchical cross-modal talking face generation with dynamic pixel-wise loss. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019, 7832–7841
45 O Abdel-Hamid , A r Mohamed , H Jiang , L Deng , G Penn , D Yu . Convolutional neural networks for speech recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2014, 22( 10): 1533– 1545
46 Sainath T N, Li B. Modeling time-frequency patterns with lstm vs. convolutional architectures for lvcsr tasks. In: Proceedings of Interspeech. 2016, 813–817
47 Liu Y, Wang D. Time and frequency domain long short-term memory for noise robust pitch tracking. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. 2017, 5600–5604
48 M Denil , L Bazzani , H Larochelle , N de Freitas . Learning where to attend with deep architectures for image tracking. Neural Computation, 2012, 24( 8): 2151– 2184
49 Bahdanau D, Cho K, Bengio Y. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv: 14090473, 2014
50 Paszke A, Gross S, Chintala S, Chanan G, Yang E, DeVito Z, Lin Z, Desmaison A, Antiga L, Lerer A. Automatic differentiation in pytorch. In: Proceedings of Neural Information Processing Systems 2017 Workshop on Autodiff. 2017
51 Kingma D P, Ba J. Adam: a method for stochastic optimization. arXiv preprint arXiv: 14126980, 2014
52 Ekman P, Friesen W V, Hager J C. Facial Action Coding System: The Manual on CD-ROM. Instructor’s Guide. Salt Lake City: Network Information Research Co., 2002
53 M Mori , K F MacDorman , N Kageki . The uncanny valley. IEEE Robotics and Automation Magazine, 2012, 19( 2): 98– 100
54 Kim C, Shin H V, Oh T H, Kaspar A, Elgharib M, Matusik W. On learning associations of faces and voices. In: Proceedings of the Asian Conference on Computer Vision. 2018, 276–292
55 Vielzeuf V, Kervadec C, Pateux S, Lechervy A, Jurie F. An occam’s razor view on learning audiovisual emotion recognition with small training sets. In: Proceedings of the 20th ACM International Conference on Multimodal Interaction. 2018, 589–593
56 E Avots , T Sapiński , M Bachmann , D Kamińska . Audiovisual emotion recognition in wild. Machine Vision and Applications, 2019, 30( 5): 975– 985
57 Oh T H, Dekel T, Kim C, Mosseri I, Freeman W T, Rubinstein M, Matusik W. Speech2face: learning the face behind a voice. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019, 7539–7548
58 Wang R, Liu X, Cheung Y m, Cheng K, Wang N, Fan W. Learning discriminative joint embeddings for efficient face and voice association. In: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. 2020, 1881–1884
59 Zhu H, Luo M, Wang R, Zheng A, He R. Deep audio-visual learning: a survey. arXiv preprint arXiv: 200104758, 2020
60 Ginosar S, Bar A, Kohavi G, Chan C, Owens A, Malik J. Learning individual styles of conversational gesture. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019, 3497–3506
[1] Highlights Download
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed