A Novel Way to Start Speech Dialogs in Cars by Talk-and-Push (TAP) - Digital Signal Processing for In-Vehicle Systems and Safety

Digital Signal Processing Reference

In-Depth Information

where the postfilter is given by the generalized Wiener filter

F ss;' ð

W ' ð

Þ¼

Þ ;

(7.6)

F ss;' ð

Þþ

X ' ð

F hh;' ð

ÞþF nn;' ð

with

denoting the power spectral density (PSD) of

the desired speech signal s ( n ), the echo path covariance in the frequency domain,

and the PSD of the background noise n ( n ), respectively. Since the covariance

F hh;' ð

F ss;' ð

F hh;' ð

, and

F nn;' ð

can be taken as an uncertainty measure of the LEM system identification,

the product X ' ð

F hh;' ð

represents the PSD of the residual echo. The PSDs

F ss;' ð

are estimated according to [ 4 ]. Finally, the postfilter gain

W ' ( k ) is floored to W min ¼

and

F nn;' ð

12.6 dB.

7.4

Integrated Noise Reduction and Voice Activity Detection

Subsequent to echo cancelation, residual vehicle noise n ( n )aswellassomeremainsof

the beep may still be contained in the error signal e ( n ). In the upper path of the TAP

system, robust detection of the speech onset therefore requires these disturbances to

be distinguished from the desired speech component s ( n ). This problem is here

approached with a combined additional noise reduction and VAD operating on the

short-time spectrum E ' ( k ) of the error signal.

For the removal of the beep, all frequency bins corresponding to the frequency

range fromabout 1.83 to 2.45 kHz are set to zero. For each frame

and frequency bin k ,

the estimated clean speech spectrum S ' ð

is obtained from the error signal E ' ( k )by

applying aWiener filter based on the a priori signal-to-noise ratio (SNR) as described in

[ 5 ]and[ 6 ]. For the computation of this SNR, the power spectral density of the noise is

estimated by employing a 3-state time- and frequency-dependent VAD [ 3 ].

The output of the VAD is transformed into a per-frame voice activity signal v

[0, 1] by averaging over relevant frequency bins (see [ 3 ]) and then stored in the

upper circular buffer as shown in Fig. 7.1 . The final decision about the time of the

speech onset is made by the VAD control unit: The hypothesized speech onset

frame

' SOU is the latest nonspeech frame (i. e., v (

' SOU )

0) before v (

) exceeds an

empirical threshold [ 3 ].

7.5 Experimental Setup

For experimental evaluation, we performed an offline batch simulation of the TAP

system using the Cambridge HiddenMarkovModel Toolkit (HTK) for ASR. Instead

of a physical LEM system, we used a digital LEM impulse response measured inside

a vehicle. In the next two subsections, the near-end speech files as well as the noise

and echo signals are described.

Digital Signal Processing for In-Vehicle Systems and Safety

Search WWH ::

Custom Search

Home