Fine-grained sequence-to-sequence lip reading based on self-attention and self-distillation

doi:10.1007/s11704-023-2230-x

Front. Comput. Sci.

2023, Vol. 17

Issue (6) : 176344 https://doi.org/10.1007/s11704-023-2230-x

Artificial Intelligence

Fine-grained sequence-to-sequence lip reading based on self-attention and self-distillation

Junxiao XUE¹, Shibo HUANG²(

), Huawei SONG³, Lei SHI³

¹. Research Institute of Artificial Intelligence, Zhejiang Lab, Hangzhou 311121, China
². College of Life Sciences, University of Chinese Academy of Sciences, Beijing 100049, China
³. School of Cyber Science and Engineering, Zhengzhou University, Zhengzhou 450002, China

Download: PDF(851 KB) HTML
Export: BibTeX | EndNote | Reference Manager | ProCite | RefWorks

Corresponding Author(s): Shibo HUANG

Just Accepted Date: 10 February 2023 Issue Date: 30 March 2023

Cite this article:

Junxiao XUE,Shibo HUANG,Huawei SONG, et al. Fine-grained sequence-to-sequence lip reading based on self-attention and self-distillation[J]. Front. Comput. Sci., 2023, 17(6): 176344.

URL:

https://academic.hep.com.cn/fcs/EN/10.1007/s11704-023-2230-x
https://academic.hep.com.cn/fcs/EN/Y2023/V17/I6/176344

Fig.1 Framework of seq2seq lip reading based on self-attention and self-distillation

Datasets	Methods	CER $↓$ /%	WER $↓$ /%
GRID	$λ = 0$	0.579	2.357
	$λ = 0.1$	0.552	2.236
	$λ = 0.3$	0.541	2.163
	$λ = 0.5$	0.544	2.201
	+ SE Blocks ( $λ = 0.3$ )	0.458	1.846
	+ Resformer ( $λ = 0.3$ )	0.459	1.774
LRW	$λ = 0$	17.442	22.696
	$λ = 0.1$	15.478	20.276
	$λ = 0.3$	12.400	16.444
	$λ = 0.5$	12.783	17.200
	+ SE Blocks ( $λ = 0.3$ )	11.807	15.780
	+ Resformer ( $λ = 0.3$ )	11.023	14.752
LRW-1000	$λ = 0$	53.918	65.048
	$λ = 0.1$	49.829	59.413
	$λ = 0.3$	47.812	57.650
	$λ = 0.5$	51.130	61.929
	+ SE Blocks ( $λ = 0.3$ )	45.944	55.771
	+ Resformer ( $λ = 0.3$ )	44.381	54.603

Tab.1 The results of ablation experiments on GRID, LRW and LRW-1000 datasets

Datasets	Methods	CER $↓$ /%	WER $↓$ /%
GRID	LipNet^[²^]	1.9	4.8
	WAS^[³^]	?	3.0
	LCANet^[⁴^]	1.3	2.9
	Face(Cutout)^[⁵^]	1.2	2.9
	Ours	0.46	1.77

Tab.2 The results of sentence-level experiment on GRID dataset

Datasets	Methods	CER $↓$ /%	WER $↓$ /%	Acc $↑$ /%
LRW	PCPG^[⁶^]	14.1	22.7	77.3
	STFM^[⁷^]	?	16.3	83.7
	Ours	11.02	14.75	85.25
LRW-1000	PCPG^[⁶^]	51.3	66.9	33.1
LRW-1000	Ours	44.38	54.60	45.40

Tab.3 Comparison with sentence-level experiments on LRW and LRW-1000 datasets

1	J, Xiao S, Yang Y, Zhang S, Shan X Chen . Deformation flow based two-stream network for lip reading. In: Proceedings of the 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020), 2020, 364–370
2	Y M, Assael B, Shillingford S, Whiteson Freitas N De . LipNet: End-to-end sentence-level lipreading. 2017, arXiv preprint arXiv: 1611, 0159, 9
3	J S, Chung A, Senior O, Vinyals . et al.. Lip reading sentences in the wild. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, 3444–3453
4	K, Xu D, Li N, Cassimatis X Wang . LCANet: End-to-end lipreading with cascaded attention-CTC. In: Proceedings of the 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018), 2018, 548–555
5	Y, Zhang S, Yang J, Xiao . et al.. Can we read speech beyond the lips? rethinking roi selection for deep visual speech recognition. In: Proceedings of the 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020), 2020, 356–363
6	M, Luo S, Yang S, Shan X Chen . Pseudo-convolutional policy gradient for sequence-to-sequence lip-reading. In: Proceedings of the 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020), 2020, 273–280
7	X, Zhang F, Cheng S Wang . Spatio-temporal fusion based convolutional sequence learning for lip reading. In: Proceedings of 2019 IEEE/CVF International Conference on Computer Vision, 2019, 713–722

[1]

FCS-22230-OF-JX_suppl_1

Download

Viewed

Full text

Abstract

Cited

Shared

Discussed