Digital Signal Processing Reference
In-Depth Information
4.3
Robust Speech Recognition
The CU-Move system incorporates a number of advances in robust speech
recognition including a new more robust acoustic feature representation and
built-in speaker normalization. Here, we report results from evaluations using
CU-Move Release 1.1 A data from the extended digits part aimed at phone
dialing applications.
Capturing the vocal tract transfer function (VTTF) from the speech signal
while eliminating other extraneous information, such as speaker dependent
characteristics and pitch harmonics, is a key requirement for robust and
accurate speech recognition [33, 34]. The vocal tract transfer function is
mainly encoded in the short-term spectral envelope [35]. Traditional MFCCs
use the gross spectrum obtained as the output of a non-linearly spaced
filterbank to represent the spectral envelope. While this approach is good for
unvoiced sounds, there is a substantial mismatch for voiced and mixed sounds
[34]. For voiced speech, the formant frequencies are biased towards strong
harmonics and their bandwidths are misestimated [34,35]. MFCCs are known
to be fragile in noisy conditions, requiring additional compensation for
acceptable performance in realistic environments [45,28].
Minimum Variance Distortionless Response (MVDR) spectrum has a long
history in signal processing but recently applied successfully to speech
modeling [36]. It has many desired characteristics for a spectral envelope
estimation method, most important being the fact it estimates the spectral
powers accurately at the perceptually important harmonics, thereby providing
an upper envelope which has strong implications for robustness in additive
noise. Since the upper envelope relies on the high-energy portions of the
spectrum, it will not be affected substantially by additive noise. Therefore,
using MVDR for spectral envelope estimation for robust speech recognition is
feasible and useful [37].
4.3.1
MVDR Spectral Envelope Estimation:
For details of MVDR spectrum estimation and its previous uses for speech
parameterization, we refer the reader to [36,37,38,39,40]. In the MVDR
spectrum estimation, the signal power at a frequency, is determined by
filtering the signal by a specially designed FIR filter, h(n), and measuring the
power at its output. The FIR filter, h(n), is designed to minimize its output
Search WWH ::




Custom Search