Audio-guided self-supervised learning for disentangled visual speech representations

doi:10.1007/s11704-024-3787-8

Front. Comput. Sci.

2024, Vol. 18

Issue (6) : 186353 https://doi.org/10.1007/s11704-024-3787-8

Artificial Intelligence

Audio-guided self-supervised learning for disentangled visual speech representations

Dalu FENG^1,², Shuang YANG^1,²(

), Shiguang SHAN^1,², Xilin CHEN^1,²

¹. Key Laboratory of Intelligent Information Processing of Chinese Academy of Sciences,Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China
². University of Chinese Academy of Sciences, Beijing 100049, China

Download: PDF(625 KB) HTML
Export: BibTeX | EndNote | Reference Manager | ProCite | RefWorks

Corresponding Author(s): Shuang YANG

Just Accepted Date: 10 May 2024 Issue Date: 21 June 2024

Cite this article:

Dalu FENG,Shuang YANG,Shiguang SHAN, et al. Audio-guided self-supervised learning for disentangled visual speech representations[J]. Front. Comput. Sci., 2024, 18(6): 186353.

URL:

https://academic.hep.com.cn/fcs/EN/10.1007/s11704-024-3787-8
https://academic.hep.com.cn/fcs/EN/Y2024/V18/I6/186353

Fig.1 The illustration of our proposed two-branch model for disentangled visual speech representation learning

Year	Method	Accuracy/% $↑$
2020	Ma et al. [4] (Ensemble)	88.6
2021	Ma et al. [5]	88.4
2022	Koumparoulis et al. [6]	89.5
(a)	Ours Baseline $M o r i g i n a l$	84.5
(b)	Speech-relevant deformation flow $F a d j ′ S$ learned w/o $B S ˉ$ ( $M F a d j S ′$ )	89.5
(c)	Speech-relevant deformation flow $F a d j S$ learned w $B S ˉ$ ( $M F a d j S$ )	91.4
(d)	Flow-distilled VSR model ( $M o r i g i n a l ′$ )	85.5

Tab.1 Comparison with the state-of-the-art methods on LRW

Year	Method	Training data	Total hours/h	WER/% $↓$
2021	Ma et al. [7]	LRW, LRS2, LRS3	804	49.2
2022	Ma et al. [8]	LRW, LRS3, AVSpeech	1495	25.5
2023	Ma et al. [9]	LRW, LRS2, LRS3, VoxCeleb2, AVSpeech	3448	14.6
2021	Ma et al. [7] (Reproduce)	LRS2	195	49.9
(a)	Ours baseline $M o r i g i n a l$	LRS2	195	44.8
(b)	Flow-based VSR $M F a d j S$			22.1
(c)	Flow-distilled VSR $M o r i g i n a l ′$			41.8

Tab.2 Comparison with the state-of-the-art methods on LRS2

1	B, Shi W N, Hsu K, Lakhotia A Mohamed . Learning audio-visual speech representation by masked multimodal cluster prediction. In: Proceedings of the 10th International Conference on Learning Representations. 2022
2	W N, Hsu B Shi . u-HuBERT: unified mixed-modal speech pretraining and zero-shot transfer to unlabeled modality. In: Proceedings of the 36th International Conference on Neural Information Processing Systems. 2022, 1538
3	T, Stafylakis G Tzimiropoulos . Combining residual networks with LSTMs for lipreading. In: Proceedings of the 18th Annual Conference of the International Speech Communication Association. 2017, 3652−3656
4	P, Ma B, Martinez S, Petridis M Pantic . Towards practical lipreading with distilled and efficient models. In: Proceedings of 2021 IEEE International Conference on Acoustics, Speech and Signal Processing. 2021, 7608−7612
5	P, Ma Y, Wang J, Shen S, Petridis M Pantic . Lip-reading with densely connected temporal convolutional networks. In: Proceedings of 2021 IEEE Winter Conference on Applications of Computer Vision. 2021, 2856−2865
6	A, Koumparoulis G Potamianos . Accurate and resource-efficient lipreading with efficientnetv2 and transformers. In: Proceedings of 2022 IEEE International Conference on Acoustics, Speech and Signal Processing. 2022, 8467−8471
7	P, Ma S, Petridis M Pantic . End-to-end audio-visual speech recognition with conformers. In: Proceedings of 2021 IEEE International Conference on Acoustics, Speech and Signal Processing. 2021, 7613−7617
8	P, Ma S, Petridis M Pantic . Visual speech recognition for multiple languages in the wild. Nature Machine Intelligence, 2022, 4( 11): 930–939
9	P, Ma A, Haliassos A, Fernandez-Lopez H, Chen S, Petridis M Pantic . Auto-AVSR: audio-visual speech recognition with automatic labels. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. 2023, 1−5
10	Y, Yang Y, Zhuang Y Pan . Multiple knowledge representation for big data artificial intelligence: framework, applications, and case studies. Frontiers of Information Technology & Electronic Engineering, 2021, 22( 12): 1551–1558

[1]	FCS-23787-OF-DF_suppl_1	Download
[2]	FCS-23787-OF-DF_suppl_2	Download

Viewed

Full text

Abstract

Cited

Shared

Discussed