Digital Signal Processing Reference
In-Depth Information
where the postfilter is given by the generalized Wiener filter
F
ss;'
ð
k
Þ
W
'
ð
k
Þ¼
Þ
;
(7.6)
2
F
ss;'
ð
k
Þþ
j
X
'
ð
k
Þ
j
F
hh;'
ð
k
ÞþF
nn;'
ð
k
with
denoting the power spectral density (PSD) of
the desired speech signal
s
(
n
), the echo path covariance in the frequency domain,
and the PSD of the background noise
n
(
n
), respectively. Since the covariance
F
hh;'
ð
F
ss;'
ð
k
Þ
,
F
hh;'
ð
k
Þ
, and
F
nn;'
ð
k
Þ
can be taken as an uncertainty measure of the LEM system identification,
the product
X
'
ð
k
Þ
2
j
k
Þ
j
F
hh;'
ð
k
Þ
represents the PSD of the residual echo. The PSDs
F
ss;'
ð
are estimated according to [
4
]. Finally, the postfilter gain
W
'
(
k
) is floored to
W
min
¼
k
Þ
and
F
nn;'
ð
k
Þ
12.6 dB.
7.4
Integrated Noise Reduction and Voice Activity Detection
Subsequent to echo cancelation, residual vehicle noise
n
(
n
)aswellassomeremainsof
the beep may still be contained in the error signal
e
(
n
). In the upper path of the TAP
system, robust detection of the speech onset therefore requires these disturbances to
be distinguished from the desired speech component
s
(
n
). This problem is here
approached with a combined additional noise reduction and VAD operating on the
short-time spectrum
E
'
(
k
) of the error signal.
For the removal of the beep, all frequency bins corresponding to the frequency
range fromabout 1.83 to 2.45 kHz are set to zero. For each frame
'
and frequency bin
k
,
the estimated clean speech spectrum
S
'
ð
is obtained from the error signal
E
'
(
k
)by
applying aWiener filter based on the a priori signal-to-noise ratio (SNR) as described in
[
5
]and[
6
]. For the computation of this SNR, the power spectral density of the noise is
estimated by employing a 3-state time- and frequency-dependent VAD [
3
].
The output of the VAD is transformed into a per-frame voice activity signal
v
k
Þ
2
[0, 1] by averaging over relevant frequency bins (see [
3
]) and then stored in the
upper circular buffer as shown in Fig.
7.1
. The final decision about the time of the
speech onset is made by the VAD control unit: The hypothesized speech onset
frame
'
SOU
is the latest nonspeech frame (i. e.,
v
(
'
SOU
)
0) before
v
(
'
) exceeds an
empirical threshold [
3
].
7.5 Experimental Setup
For experimental evaluation, we performed an offline batch simulation of the TAP
system using the Cambridge HiddenMarkovModel Toolkit (HTK) for ASR. Instead
of a physical LEM system, we used a digital LEM impulse response measured inside
a vehicle. In the next two subsections, the near-end speech files as well as the noise
and echo signals are described.
Search WWH ::
Custom Search