Digital Signal Processing Reference
In-Depth Information
the PTS button has been pressed. Similarly, even experienced users may not always
conform to the required sequence simply for impatience or because they are
concentrating on the driving task. As a consequence, the portion of speech uttered
prematurely will not be processed by the system, resulting in recognition errors.
Another source of degradation is acoustic leaking of music or speech being
presented via the car audio system into the hands-free microphone. Since the
automatic speech recognition (ASR) engine generally cannot distinguish such
signal components from the user's voice commands, the result will be recognition
errors. In many commercial systems, this problem is approached by muting the
loudspeakers upon PTS button actuation. However, muting cannot be performed
instantaneously, thus leaving some disturbances in the microphone signal. More-
over, it is not always advisable to mute the loudspeaker signal. For example, the car
computer may need to deliver urgent voice notifications at any time, regardless of
whether the system is engaged in a speech dialog.
Instead of muting, some state-of-the-art systems employ acoustic echo cancel-
ation (AEC) methods [ 1 , 2 ], which strive to estimate and remove the acoustic signal
component captured by the hands-free microphone originating from the car
loudspeakers. While AEC makes muting unnecessary, this method alone still
does not provide for intuitive dialog initiation. An extended and more flexible
solution, the so-called talk-and-push (TAP) system, has been proposed in [ 3 ].
It allows the user to start speaking within a certain time frame before or after
PTS button actuation. This is achieved by employing a look-back speech buffer in
conjunction with an AEC unit and a robust SOU detection. The experiments in [ 3 ]
were conducted at a sampling frequency of 8 kHz and using the normalized least-
mean-square (NLMS) algorithm for AEC.
In this chapter, we investigate the performance of a TAP system operating at
16 kHz sampling frequency and employing the frequency-domain adaptive filter
(FDAF) as proposed in [ 4 ] for AEC. While the higher sampling rate was chosen to
open the prospect of more complex ASR tasks, the FDAF offers lower computa-
tional complexity than a 16 kHz NLMS algorithm, as well as a built-in postfilter for
residual echo suppression.
The remainder of this chapter is organized as follows: Section 7.2 outlines the
TAP system architecture. The implementation of the system components—AEC,
noise reduction, and SOU detection—is described in Sects. 7.3 and 7.4 . Section 7.5
then summarizes the experimental setup, followed by a discussion of the simulation
results in Sect. 7.6 .
7.2 The Talk-and-Push System
We assume the typical setup of an in-car speech dialog system: It consists of
a speaker (e. g., the driver) seated in a vehicle, a hands-free microphone for voice
control, and an in-car loudspeaker system reproducing voice prompts or music from
the FM radio. In the microphone, the speaker's speech signal s is disturbed by
additive background noise n and the reverberated loudspeaker signal d . In the
Search WWH ::




Custom Search