Digital Signal Processing Reference
In-Depth Information
of that work, magnitude spectral subtraction and log-MMSE estimation were
adopted for background noise reduction, together with quantile noise
estimation.
In [10] an optimal set of parameters was determined for the use of spectral
subtraction and log-MMSE on a connected digit recognition task; the same set
is used here. The noise subtraction module is used only for far-microphone
input processing. The front-end processing includes a Voice Activity
Detection (VAD) module. It is based on the energy information in the case of
close-talk input and on a spectral variation function technique applied to the
output of the Mel-based filter bank in the case of far-microphone signal.
According to preliminary experiments on SpeechDat.Car material [11], both
techniques allow recognition performance equivalent to that determined by
using manually segmented utterances, except for cases of unstationary noise
events.
The feature extraction module processes the input signal pre-emphasizing
and blocking it into frames of 20 ms duration from which 12 Mel scaled
Cepstral Coefficients (MCCs) and the log-energy are extracted. MCCs are
normalized by subtracting the current MCC means. The log-energy is also
normalized with respect to the current maximum energy value. The resulting
MCCs and the normalized log-energy, together with their first and second
order time derivatives, are arranged into a single observation vector of 39
components.
2.2
Recognition Engine
The recognition engine for the Italian language is composed of a set of
standard HMM recognition units. Each of them runs independently and
processes the features provided by the front-end module. The HMM units are
based on a set of 34 phone-like speech units. Each acoustic-phonetic unit is
modeled with left-to-right Continuous Density HMMs with output probability
distributions represented by means of mixtures having 16 Gaussian
components with diagonal covariance matrices. HMM training is
accomplished through the standard Baum-Welch training procedure. Phone
units were trained by using far-microphone (and close-talk) signals available
in the Italian portion of SpeechDat.Car corpus. The training portion of this
corpus consists in about 3000 phonetically rich sentences pronounced by 150
speakers.
A crucial aspect is the selection of the output to feed NLU module. For a
given input utterance, the outputs provided by the different active units have
to be compared each other in a reliable way. Although more sophisticated
approaches are possible [12], in this work the simplest decision policy is
Search WWH ::




Custom Search