鸡尾酒会问题的技术回顾、当前进展及未来挑战

doi:10.1631/FITEE.1700814

Frontiers of Information Technology & Electronic Engineering

2018, Vol. 19

Issue (1): 40-63 https://doi.org/10.1631/FITEE.1700814

本期目录

鸡尾酒会问题的技术回顾、当前进展及未来挑战

钱彦旻¹(

), 翁超¹, 常烜恺², 王帅², 俞栋¹

¹. 腾讯人工智能实验室，美国华盛顿州贝尔维尤市，98004
². 上海交通大学计算机科学与工程系，中国上海市，200240

Past review, current progress, and challenges ahead on the cocktail party problem

Yan-min QIAN¹(

), Chao WENG¹, Xuan-kai CHANG², Shuai WANG², Dong YU¹

¹. Tencent AI Lab, Tencent, Bellevue 98004, USA
². Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai 200240, China

全文: PDF(983 KB)

摘要:

鸡尾酒会问题即在多人同时说话的场景下追踪并识别某一个特定说话人的语音。在自动语音识别技术大规模推广应用中，鸡尾酒会问题是亟待解决的关键问题之一。本文回顾了在过去20多年中针对鸡尾酒会问题提出的相关技术。主要讨论在鸡尾酒会问题中扮演中心角色的语音分离问题。介绍了以下内容：传统的单通道情况下的技术，如计算听觉场景分析（computational auditory scene analysis, CASA）、非负矩阵分解（non-negative matrix factorization, NMF）以及生成式模型建模；传统的多通道情况下的技术，如波束成形和多通道盲源分离；一些基于深度学习的最新技术，如深度聚类（deep clustering, DPCL）、深度吸引网络（deep attractor network, DANet）以及排列不变性训练（permutation invariant training, PIT）。此外，介绍了在鸡尾酒会环境下针对改善多说话人语音识别和说话人识别精度的相关技术。笔者认为，利用一个更加强大的模型来有效地开发和利用来自麦克风阵列、声学训练集合以及语言本身的知识非常重要。更好的优化策略和技术的提出会逐步解决鸡尾酒会问题。

Abstract：

The cocktail party problem, i.e., tracing and recognizing the speech of a specific speaker when multiple speakers talk simultaneously, is one of the critical problems yet to be solved to enable the wide application of automatic speech recognition (ASR) systems. In this overview paper, we review the techniques proposed in the last two decades in attacking this problem. We focus our discussions on the speech separation problem given its central role in the cocktail party environment, and describe the conventional single-channel techniques such as computational auditory scene analysis (CASA), non-negative matrix factorization (NMF) and generative models, the conventional multi-channel techniques such as beamforming and multi-channel blind source separation, and the newly developed deep learning-based techniques, such as deep clustering (DPCL), the deep attractor network (DANet), and permutation invariant training (PIT). We also present techniques developed to improve ASR accuracy and speaker identification in the cocktail party environment. We argue effectively exploiting information in the microphone array, the acoustic training set, and the language itself using a more powerful model. Better optimization objective and techiques will be the approach to solving the cocktail party problem.

Key words： Cocktail party problem Computational auditory scene analysis Non-negative matrix factorization Permutation invariant training Multi-talker speech processing

收稿日期: 2017-12-08 出版日期: 2018-04-23

通讯作者: 钱彦旻 E-mail: yanminqian@tencent.com

Corresponding Author(s): Yan-min QIAN

引用本文:

钱彦旻, 翁超, 常烜恺, 王帅, 俞栋. 鸡尾酒会问题的技术回顾、当前进展及未来挑战[J]. Frontiers of Information Technology & Electronic Engineering, 2018, 19(1): 40-63.
Yan-min QIAN, Chao WENG, Xuan-kai CHANG, Shuai WANG, Dong YU. Past review, current progress, and challenges ahead on the cocktail party problem. Front. Inform. Technol. Electron. Eng, 2018, 19(1): 40-63.

链接本文:

https://academic.hep.com.cn/fitee/CN/10.1631/FITEE.1700814
https://academic.hep.com.cn/fitee/CN/Y2018/V19/I1/40

[1]		Download
[2]		Download

Viewed

Full text

Abstract

Cited

Shared

Discussed