Digital Signal Processing Reference
In-Depth Information
where the postfilter is given by the generalized Wiener filter
F ss;' ð
k
Þ
W ' ð
k
Þ¼
Þ ;
(7.6)
2
F ss;' ð
k
Þþ
j
X ' ð
k
Þ
j
F hh;' ð
k
ÞþF nn;' ð
k
with
denoting the power spectral density (PSD) of
the desired speech signal s ( n ), the echo path covariance in the frequency domain,
and the PSD of the background noise n ( n ), respectively. Since the covariance
F hh;' ð
F ss;' ð
k
Þ
,
F hh;' ð
k
Þ
, and
F nn;' ð
k
Þ
can be taken as an uncertainty measure of the LEM system identification,
the product X ' ð
k
Þ
2
j
k
Þ
j
F hh;' ð
k
Þ
represents the PSD of the residual echo. The PSDs
F ss;' ð
are estimated according to [ 4 ]. Finally, the postfilter gain
W ' ( k ) is floored to W min ¼
k
Þ
and
F nn;' ð
k
Þ
12.6 dB.
7.4
Integrated Noise Reduction and Voice Activity Detection
Subsequent to echo cancelation, residual vehicle noise n ( n )aswellassomeremainsof
the beep may still be contained in the error signal e ( n ). In the upper path of the TAP
system, robust detection of the speech onset therefore requires these disturbances to
be distinguished from the desired speech component s ( n ). This problem is here
approached with a combined additional noise reduction and VAD operating on the
short-time spectrum E ' ( k ) of the error signal.
For the removal of the beep, all frequency bins corresponding to the frequency
range fromabout 1.83 to 2.45 kHz are set to zero. For each frame
'
and frequency bin k ,
the estimated clean speech spectrum S ' ð
is obtained from the error signal E ' ( k )by
applying aWiener filter based on the a priori signal-to-noise ratio (SNR) as described in
[ 5 ]and[ 6 ]. For the computation of this SNR, the power spectral density of the noise is
estimated by employing a 3-state time- and frequency-dependent VAD [ 3 ].
The output of the VAD is transformed into a per-frame voice activity signal v
k
Þ
2
[0, 1] by averaging over relevant frequency bins (see [ 3 ]) and then stored in the
upper circular buffer as shown in Fig. 7.1 . The final decision about the time of the
speech onset is made by the VAD control unit: The hypothesized speech onset
frame
' SOU is the latest nonspeech frame (i. e., v (
' SOU )
0) before v (
'
) exceeds an
empirical threshold [ 3 ].
7.5 Experimental Setup
For experimental evaluation, we performed an offline batch simulation of the TAP
system using the Cambridge HiddenMarkovModel Toolkit (HTK) for ASR. Instead
of a physical LEM system, we used a digital LEM impulse response measured inside
a vehicle. In the next two subsections, the near-end speech files as well as the noise
and echo signals are described.
Search WWH ::




Custom Search