Digital Signal Processing Reference
In-Depth Information
For each observation, 39-dimensional MFCC feature vectors were generated
consisting of 13 MFCC (including C 0 ) plus 13 delta and 13 acceleration coefficients.
Cepstral mean subtraction was applied to each feature. The elements of the Jacobian
were derived from this feature representation as per Eq. 10.7 .
The recognition task uses an open word loop grammar [ 12 ]; therefore, no
restrictions are made to ensure that exactly ten digits are recognized.
All speech recognition results quoted in this chapter are word accuracies (in %)
and are calculated as
N
D
S
I
Accuracy
¼
100
;
(10.8)
N
where N represents the total number of words, D the number of deletions, S the
number of substitutions, and I the number of insertions [ 13 ].
10.3.3 Optimization Iterations
Since LIMA is an optimization problem, over-optimization of the enhancement
parameters to a specific noise condition, speaker, or subset of acoustic state models
is highly possible and should be avoided. This suggests that the number of optimi-
zation iterations should not be large in order to maintain generality across
conditions, but too little iteration may result in the LIMA framework operating
less effectively than a standard enhancement system. Considering real-time opera-
tion (another important consideration for in-car ASR) also points to limited
iterations.
To address this issue, two experiments were designed to determine a suitable
balance between ASR performance and pseudo real-time operation using the noise-
only calibration framework described in Sect. 10.3.4 . This framework was used
since the belief was that noise conditions have a greater effect on the resulting
enhancement parameters than individual speakers since speaker-independent
acoustic models are being used.
In the first experiment, the number of gradient-descent iterations was varied
whilst using a single joint optimization iteration (i.e., full recognition and parameter
optimization cycles). The second experiment varied the number of joint optimiza-
tion iterations whilst the gradient-descent iterations (determined from the former
experiment) were kept constant. The combined outcomes of these experiments
dictated the levels of optimization used for assessing the frameworks detailed in
Sect. 10.3.4 .
For all experiments, the enhancement parameters were initialized to
1 for
all 26 Mel-filterbanks. These values were an appropriate initial guess since standard
MFNS using these values provides improvements in speech recognition accuracy
over a system without enhancement [ 10 ].
k
Þ¼
Search WWH ::




Custom Search