[2207.07296] Direction-Aware Adaptive Online Neural Speech Enhancement with an Augmented Reality Headset in Real Noisy Conversational Environments
This paper describes the practical response- and performance-aware
development of online speech enhancement for an augmented reality (AR) headset
that helps a user understand conversations made in real noisy echoic
environments (e.g., cocktail party). One may use a state-of-the-art blind
source separation method called fast multichannel nonnegative matrix
factorization (FastMNMF) that works well in various environments thanks to its
unsupervised nature. Its heavy computational cost, however, prevents its
application to real-time processing. In contrast, a supervised beamforming
method that uses a deep neural network (DNN) for estimating spatial information
of speech and noise readily fits real-time processing, but suffers from drastic
performance degradation in mismatched conditions. Given such complementary
characteristics, we propose a dual-process robust online speech enhancement
method based on DNN-based beamforming with FastMNMF-guided adaptation. FastMNMF
(back end) is performed in a mini-batch style and the noisy and enhanced speech
pairs are used together with the original parallel training data for updating
the direction-aware DNN (front end) with backpropagation at a
computationally-allowable interval. This method is used with a blind
dereverberation method called weighted prediction error (WPE) for transcribing
the noisy reverberant speech of a speaker, which can be detected from video or
selected by a user’s hand gesture or eye gaze, in a streaming manner and
spatially showing the transcriptions with an AR technique. Our experiment
showed that the word error rate was improved by more than 10 points with the
run-time adaptation using only twelve minutes of observation.
This content was originally published here.