A Novel Way to Start Speech Dialogs in Cars by Talk-and-Push (TAP) - Digital Signal Processing for In-Vehicle Systems and Safety

Digital Signal Processing Reference

In-Depth Information

the PTS button has been pressed. Similarly, even experienced users may not always

conform to the required sequence simply for impatience or because they are

concentrating on the driving task. As a consequence, the portion of speech uttered

prematurely will not be processed by the system, resulting in recognition errors.

Another source of degradation is acoustic leaking of music or speech being

presented via the car audio system into the hands-free microphone. Since the

automatic speech recognition (ASR) engine generally cannot distinguish such

signal components from the user's voice commands, the result will be recognition

errors. In many commercial systems, this problem is approached by muting the

loudspeakers upon PTS button actuation. However, muting cannot be performed

instantaneously, thus leaving some disturbances in the microphone signal. More-

over, it is not always advisable to mute the loudspeaker signal. For example, the car

computer may need to deliver urgent voice notifications at any time, regardless of

whether the system is engaged in a speech dialog.

Instead of muting, some state-of-the-art systems employ acoustic echo cancel-

ation (AEC) methods [ 1 , 2 ], which strive to estimate and remove the acoustic signal

component captured by the hands-free microphone originating from the car

loudspeakers. While AEC makes muting unnecessary, this method alone still

does not provide for intuitive dialog initiation. An extended and more flexible

solution, the so-called talk-and-push (TAP) system, has been proposed in [ 3 ].

It allows the user to start speaking within a certain time frame before or after

PTS button actuation. This is achieved by employing a look-back speech buffer in

conjunction with an AEC unit and a robust SOU detection. The experiments in [ 3 ]

were conducted at a sampling frequency of 8 kHz and using the normalized least-

mean-square (NLMS) algorithm for AEC.

In this chapter, we investigate the performance of a TAP system operating at

16 kHz sampling frequency and employing the frequency-domain adaptive filter

(FDAF) as proposed in [ 4 ] for AEC. While the higher sampling rate was chosen to

open the prospect of more complex ASR tasks, the FDAF offers lower computa-

tional complexity than a 16 kHz NLMS algorithm, as well as a built-in postfilter for

residual echo suppression.

The remainder of this chapter is organized as follows: Section 7.2 outlines the

TAP system architecture. The implementation of the system components—AEC,

noise reduction, and SOU detection—is described in Sects. 7.3 and 7.4 . Section 7.5

then summarizes the experimental setup, followed by a discussion of the simulation

results in Sect. 7.6 .

7.2 The Talk-and-Push System

We assume the typical setup of an in-car speech dialog system: It consists of

a speaker (e. g., the driver) seated in a vehicle, a hands-free microphone for voice

control, and an in-car loudspeaker system reproducing voice prompts or music from

the FM radio. In the microphone, the speaker's speech signal s is disturbed by

additive background noise n and the reverberated loudspeaker signal d . In the

Search WWH ::

Custom Search

Home