USE OF MULTIPLE SPEECH RECOGNITION UNITS IN AN IN-CAR ASSISTANCE SYSTEM - DSP for In-Vehicle and Mobile Systems

Digital Signal Processing Reference

In-Depth Information

of that work, magnitude spectral subtraction and log-MMSE estimation were

adopted for background noise reduction, together with quantile noise

estimation.

In [10] an optimal set of parameters was determined for the use of spectral

subtraction and log-MMSE on a connected digit recognition task; the same set

is used here. The noise subtraction module is used only for far-microphone

input processing. The front-end processing includes a Voice Activity

Detection (VAD) module. It is based on the energy information in the case of

close-talk input and on a spectral variation function technique applied to the

output of the Mel-based filter bank in the case of far-microphone signal.

According to preliminary experiments on SpeechDat.Car material [11], both

techniques allow recognition performance equivalent to that determined by

using manually segmented utterances, except for cases of unstationary noise

events.

The feature extraction module processes the input signal pre-emphasizing

and blocking it into frames of 20 ms duration from which 12 Mel scaled

Cepstral Coefficients (MCCs) and the log-energy are extracted. MCCs are

normalized by subtracting the current MCC means. The log-energy is also

normalized with respect to the current maximum energy value. The resulting

MCCs and the normalized log-energy, together with their first and second

order time derivatives, are arranged into a single observation vector of 39

components.

2.2

Recognition Engine

The recognition engine for the Italian language is composed of a set of

standard HMM recognition units. Each of them runs independently and

processes the features provided by the front-end module. The HMM units are

based on a set of 34 phone-like speech units. Each acoustic-phonetic unit is

modeled with left-to-right Continuous Density HMMs with output probability

distributions represented by means of mixtures having 16 Gaussian

components with diagonal covariance matrices. HMM training is

accomplished through the standard Baum-Welch training procedure. Phone

units were trained by using far-microphone (and close-talk) signals available

in the Italian portion of SpeechDat.Car corpus. The training portion of this

corpus consists in about 3000 phonetically rich sentences pronounced by 150

speakers.

A crucial aspect is the selection of the output to feed NLU module. For a

given input utterance, the outputs provided by the different active units have

to be compared each other in a reliable way. Although more sophisticated

approaches are possible [12], in this work the simplest decision policy is

Search WWH ::

Custom Search

Home