Please wait a minute...
Frontiers of Computer Science

ISSN 2095-2228

ISSN 2095-2236(Online)

CN 10-1014/TP

Postal Subscription Code 80-970

2018 Impact Factor: 1.129

Front. Comput. Sci.    2023, Vol. 17 Issue (6) : 176344    https://doi.org/10.1007/s11704-023-2230-x
Artificial Intelligence
Fine-grained sequence-to-sequence lip reading based on self-attention and self-distillation
Junxiao XUE1, Shibo HUANG2(), Huawei SONG3, Lei SHI3
1. Research Institute of Artificial Intelligence, Zhejiang Lab, Hangzhou 311121, China
2. College of Life Sciences, University of Chinese Academy of Sciences, Beijing 100049, China
3. School of Cyber Science and Engineering, Zhengzhou University, Zhengzhou 450002, China
 Download: PDF(851 KB)   HTML
 Export: BibTeX | EndNote | Reference Manager | ProCite | RefWorks
Corresponding Author(s): Shibo HUANG   
Just Accepted Date: 10 February 2023   Issue Date: 30 March 2023
 Cite this article:   
Junxiao XUE,Shibo HUANG,Huawei SONG, et al. Fine-grained sequence-to-sequence lip reading based on self-attention and self-distillation[J]. Front. Comput. Sci., 2023, 17(6): 176344.
 URL:  
https://academic.hep.com.cn/fcs/EN/10.1007/s11704-023-2230-x
https://academic.hep.com.cn/fcs/EN/Y2023/V17/I6/176344
Fig.1  Framework of seq2seq lip reading based on self-attention and self-distillation
DatasetsMethodsCER/%WER/%
GRIDλ=00.5792.357
λ=0.10.5522.236
λ=0.30.5412.163
λ=0.50.5442.201
+ SE Blocks (λ=0.3)0.4581.846
+ Resformer (λ=0.3)0.4591.774
LRWλ=017.44222.696
λ=0.115.47820.276
λ=0.312.40016.444
λ=0.512.78317.200
+ SE Blocks (λ=0.3)11.80715.780
+ Resformer (λ=0.3)11.02314.752
LRW-1000λ=053.91865.048
λ=0.149.82959.413
λ=0.347.81257.650
λ=0.551.13061.929
+ SE Blocks (λ=0.3)45.94455.771
+ Resformer (λ=0.3)44.38154.603
Tab.1  The results of ablation experiments on GRID, LRW and LRW-1000 datasets
DatasetsMethodsCER/%WER/%
GRIDLipNet[2]1.94.8
WAS[3]?3.0
LCANet[4]1.32.9
Face(Cutout)[5]1.22.9
Ours0.461.77
Tab.2  The results of sentence-level experiment on GRID dataset
DatasetsMethodsCER/%WER/%Acc/%
LRWPCPG[6]14.122.777.3
STFM[7]?16.383.7
Ours11.0214.7585.25
LRW-1000PCPG[6]51.366.933.1
Ours44.3854.6045.40
Tab.3  Comparison with sentence-level experiments on LRW and LRW-1000 datasets
1 J, Xiao S, Yang Y, Zhang S, Shan X Chen . Deformation flow based two-stream network for lip reading. In: Proceedings of the 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020), 2020, 364–370
2 Y M, Assael B, Shillingford S, Whiteson Freitas N De . LipNet: End-to-end sentence-level lipreading. 2017, arXiv preprint arXiv: 1611, 0159, 9
3 J S, Chung A, Senior O, Vinyals . et al.. Lip reading sentences in the wild. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, 3444–3453
4 K, Xu D, Li N, Cassimatis X Wang . LCANet: End-to-end lipreading with cascaded attention-CTC. In: Proceedings of the 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018), 2018, 548–555
5 Y, Zhang S, Yang J, Xiao . et al.. Can we read speech beyond the lips? rethinking roi selection for deep visual speech recognition. In: Proceedings of the 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020), 2020, 356–363
6 M, Luo S, Yang S, Shan X Chen . Pseudo-convolutional policy gradient for sequence-to-sequence lip-reading. In: Proceedings of the 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020), 2020, 273–280
7 X, Zhang F, Cheng S Wang . Spatio-temporal fusion based convolutional sequence learning for lip reading. In: Proceedings of 2019 IEEE/CVF International Conference on Computer Vision, 2019, 713–722
[1] FCS-22230-OF-JX_suppl_1 Download
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed