A Novel Way to Start Speech Dialogs in Cars by Talk-and-Push (TAP) - Digital Signal Processing for In-Vehicle Systems and Safety

Digital Signal Processing Reference

In-Depth Information

The resulting error signal e ( n ) is processed in two different branches: As shown

at the bottom of Fig. 7.1 , it is stored in a circular buffer to be fed into the ASR

engine without further processing. In the upper branch of the TAP system, it is

analyzed by an integrated additional noise reduction and voice activity detection

(VAD) as described in Sect. 7.4 . The latter's output is a voice activity signal which

is buffered and evaluated by a control unit. Upon receiving a PTS event, the control

unit locates the speech onset using buffered voice activity signal both from the past

and present. The control unit also initializes and triggers the ASR engine, which is

then supplied with a correct portion of the error signal from the lower buffer,

depending on the detected SOU.

7.3 Acoustic Echo Cancelation and Postfilter

The AEC stage of our system employs the FDAF as described in [ 4 ], which unifies

AEC and a postfilter for residual echo and noise suppression in the frequency

domain. While most echo cancellers model the impulse response h( n ) of the

LEM system—or its transfer function—deterministically, the FDAF is based on a

statistical model.

As proposed in [ 4 ], the impulse response h( n ) is modeled as a random process

with the expectation h 0 ( n ) and covariance vector

Actual estimation is performed in the frequency domain. Assuming that

variations of the LEM path over time are gradual, the LEM system transfer function

estimate H ' ð

F hh ð

is updated recursively according to

H 'þ 1 ð

A H ' ð

Þ¼

ÞþD

H ' ð

Þ;

(7.4)

is the time frame index, k is the frequency bin index, A

where

0.9995 is the

H ' ( k ) is the echo path update as computed according to [ 4 ].

Multiplying the estimated LEM transfer function H ' ð

transmission factor, and

with a short-time Fourier

transform (STFT) X ' ( k ) of the loudspeaker source signal yields the estimated echo

component D ' ð

Þ in the short-time spectral domain. This estimate is then subtracted

from the STFT Y ' ( k ) of the microphone signal, resulting in an error signal E ' ( k ).

Note that before applying the STFT to the signals x ( n ) and y ( n ), they are subject to a

high-pass filter with a cutoff frequency f c ¼

200 Hz to remove low-frequency noise.

To reduce the noise component and to suppress the residual echo that is still

present in the error signal E ' ( k ), the FDAF includes an additional frequency-

domain postfilter. Its application to the error signal yields an improved estimate

of the desired speech signal as

Þ¼E ' ð

E ' ð

W ' ð

Þ;

(7.5)

Digital Signal Processing for In-Vehicle Systems and Safety

Search WWH ::

Custom Search

Home