A Likelihood-Maximizing Framework for Enhanced In-Car Speech Recognition Based on Speech Dialog System Interaction - Digital Signal Processing for In-Vehicle Systems and Safety

Digital Signal Processing Reference

In-Depth Information

For each observation, 39-dimensional MFCC feature vectors were generated

consisting of 13 MFCC (including C 0 ) plus 13 delta and 13 acceleration coefficients.

Cepstral mean subtraction was applied to each feature. The elements of the Jacobian

were derived from this feature representation as per Eq. 10.7 .

The recognition task uses an open word loop grammar [ 12 ]; therefore, no

restrictions are made to ensure that exactly ten digits are recognized.

All speech recognition results quoted in this chapter are word accuracies (in %)

and are calculated as

N

D

S

I

Accuracy

¼

100

;

(10.8)

N

where N represents the total number of words, D the number of deletions, S the

number of substitutions, and I the number of insertions [ 13 ].

10.3.3 Optimization Iterations

Since LIMA is an optimization problem, over-optimization of the enhancement

parameters to a specific noise condition, speaker, or subset of acoustic state models

is highly possible and should be avoided. This suggests that the number of optimi-

zation iterations should not be large in order to maintain generality across

conditions, but too little iteration may result in the LIMA framework operating

less effectively than a standard enhancement system. Considering real-time opera-

tion (another important consideration for in-car ASR) also points to limited

iterations.

To address this issue, two experiments were designed to determine a suitable

balance between ASR performance and pseudo real-time operation using the noise-

only calibration framework described in Sect. 10.3.4 . This framework was used

since the belief was that noise conditions have a greater effect on the resulting

enhancement parameters than individual speakers since speaker-independent

acoustic models are being used.

In the first experiment, the number of gradient-descent iterations was varied

whilst using a single joint optimization iteration (i.e., full recognition and parameter

optimization cycles). The second experiment varied the number of joint optimiza-

tion iterations whilst the gradient-descent iterations (determined from the former

experiment) were kept constant. The combined outcomes of these experiments

dictated the levels of optimization used for assessing the frameworks detailed in

Sect. 10.3.4 .

For all experiments, the enhancement parameters were initialized to

1 for

all 26 Mel-filterbanks. These values were an appropriate initial guess since standard

MFNS using these values provides improvements in speech recognition accuracy

over a system without enhancement [ 10 ].

að

k

Þ¼

Digital Signal Processing for In-Vehicle Systems and Safety

Search WWH ::

Custom Search

Home