Please wait a minute...
Frontiers of Computer Science

ISSN 2095-2228

ISSN 2095-2236(Online)

CN 10-1014/TP

Postal Subscription Code 80-970

2018 Impact Factor: 1.129

Front. Comput. Sci.    2024, Vol. 18 Issue (6) : 186353    https://doi.org/10.1007/s11704-024-3787-8
Artificial Intelligence
Audio-guided self-supervised learning for disentangled visual speech representations
Dalu FENG1,2, Shuang YANG1,2(), Shiguang SHAN1,2, Xilin CHEN1,2
1. Key Laboratory of Intelligent Information Processing of Chinese Academy of Sciences,Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China
2. University of Chinese Academy of Sciences, Beijing 100049, China
 Download: PDF(625 KB)   HTML
 Export: BibTeX | EndNote | Reference Manager | ProCite | RefWorks
Corresponding Author(s): Shuang YANG   
Just Accepted Date: 10 May 2024   Issue Date: 21 June 2024
 Cite this article:   
Dalu FENG,Shuang YANG,Shiguang SHAN, et al. Audio-guided self-supervised learning for disentangled visual speech representations[J]. Front. Comput. Sci., 2024, 18(6): 186353.
 URL:  
https://academic.hep.com.cn/fcs/EN/10.1007/s11704-024-3787-8
https://academic.hep.com.cn/fcs/EN/Y2024/V18/I6/186353
Fig.1  The illustration of our proposed two-branch model for disentangled visual speech representation learning
Year Method Accuracy/%
2020Ma et al. [4] (Ensemble)88.6
2021Ma et al. [5]88.4
2022Koumparoulis et al. [6]89.5
(a)Ours Baseline Moriginal84.5
(b)Speech-relevant deformation flow FadjS learned w/o BSˉ (MFadjS)89.5
(c)Speech-relevant deformation flow FadjS learned w BSˉ (MFadjS)91.4
(d)Flow-distilled VSR model (Moriginal)85.5
Tab.1  Comparison with the state-of-the-art methods on LRW
YearMethodTraining dataTotal hours/hWER/%
2021Ma et al. [7]LRW, LRS2, LRS380449.2
2022Ma et al. [8]LRW, LRS3, AVSpeech149525.5
2023Ma et al. [9]LRW, LRS2, LRS3, VoxCeleb2, AVSpeech344814.6
2021Ma et al. [7] (Reproduce)LRS219549.9
(a)Ours baseline MoriginalLRS219544.8
(b)Flow-based VSR MFadjS22.1
(c)Flow-distilled VSR Moriginal41.8
Tab.2  Comparison with the state-of-the-art methods on LRS2
1 B, Shi W N, Hsu K, Lakhotia A Mohamed . Learning audio-visual speech representation by masked multimodal cluster prediction. In: Proceedings of the 10th International Conference on Learning Representations. 2022
2 W N, Hsu B Shi . u-HuBERT: unified mixed-modal speech pretraining and zero-shot transfer to unlabeled modality. In: Proceedings of the 36th International Conference on Neural Information Processing Systems. 2022, 1538
3 T, Stafylakis G Tzimiropoulos . Combining residual networks with LSTMs for lipreading. In: Proceedings of the 18th Annual Conference of the International Speech Communication Association. 2017, 3652−3656
4 P, Ma B, Martinez S, Petridis M Pantic . Towards practical lipreading with distilled and efficient models. In: Proceedings of 2021 IEEE International Conference on Acoustics, Speech and Signal Processing. 2021, 7608−7612
5 P, Ma Y, Wang J, Shen S, Petridis M Pantic . Lip-reading with densely connected temporal convolutional networks. In: Proceedings of 2021 IEEE Winter Conference on Applications of Computer Vision. 2021, 2856−2865
6 A, Koumparoulis G Potamianos . Accurate and resource-efficient lipreading with efficientnetv2 and transformers. In: Proceedings of 2022 IEEE International Conference on Acoustics, Speech and Signal Processing. 2022, 8467−8471
7 P, Ma S, Petridis M Pantic . End-to-end audio-visual speech recognition with conformers. In: Proceedings of 2021 IEEE International Conference on Acoustics, Speech and Signal Processing. 2021, 7613−7617
8 P, Ma S, Petridis M Pantic . Visual speech recognition for multiple languages in the wild. Nature Machine Intelligence, 2022, 4( 11): 930–939
9 P, Ma A, Haliassos A, Fernandez-Lopez H, Chen S, Petridis M Pantic . Auto-AVSR: audio-visual speech recognition with automatic labels. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. 2023, 1−5
10 Y, Yang Y, Zhuang Y Pan . Multiple knowledge representation for big data artificial intelligence: framework, applications, and case studies. Frontiers of Information Technology & Electronic Engineering, 2021, 22( 12): 1551–1558
[1] FCS-23787-OF-DF_suppl_1 Download
[2] FCS-23787-OF-DF_suppl_2 Download
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed