Please wait a minute...
Frontiers of Electrical and Electronic Engineering

ISSN 2095-2732

ISSN 2095-2740(Online)

CN 10-1028/TM

Front. Electr. Electron. Eng.    2006, Vol. 1 Issue (4) : 425-430    https://doi.org/10.1007/s11460-006-0081-5
Audio-visual voice activity detection
LIU Peng, WANG Zuo-ying
Department of Electronic Engineering, Tsinghua University, Beijing 100084, China;
 Download: PDF(544 KB)  
 Export: BibTeX | EndNote | Reference Manager | ProCite | RefWorks
Abstract In speech signal processing systems, frameenergy based voice activity detection (VAD) method may be interfered with the background noise and non-stationary characteristic of the frame-energy in voice segment. The purpose of this paper is to improve the performance and robustness of VAD by introducing visual information. Meanwhile, data-driven linear transformation is adopted in visual feature extraction, and a general statistical VAD model is designed. Using the general model and a two-stage fusion strategy presented in this paper, a concrete multimodal VAD system is built. Experiments show that a 55.0 % relative reduction in frame error rate and a 98.5 % relative reduction in sentence-breaking error rate are obtained when using multimodal VAD, compared to frame-energy based audio VAD. The results show that using multimodal method, sentence-breaking errors are almost avoided, and frame-detection performance is clearly improved, which proves the effectiveness of the visual modal in VAD.
Issue Date: 05 December 2006
 Cite this article:   
WANG Zuo-ying,LIU Peng. Audio-visual voice activity detection[J]. Front. Electr. Electron. Eng., 2006, 1(4): 425-430.
 URL:  
https://academic.hep.com.cn/fee/EN/10.1007/s11460-006-0081-5
https://academic.hep.com.cn/fee/EN/Y2006/V1/I4/425
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed