Digital Signal Processing Reference
In-Depth Information
For each observation, 39-dimensional MFCC feature vectors were generated
consisting of 13 MFCC (including
C
0
) plus 13 delta and 13 acceleration coefficients.
Cepstral mean subtraction was applied to each feature. The elements of the Jacobian
were derived from this feature representation as per Eq.
10.7
.
The recognition task uses an open word loop grammar [
12
]; therefore, no
restrictions are made to ensure that exactly ten digits are recognized.
All speech recognition results quoted in this chapter are word accuracies (in %)
and are calculated as
N
D
S
I
Accuracy
¼
100
;
(10.8)
N
where
N
represents the total number of words,
D
the number of deletions,
S
the
number of substitutions, and
I
the number of insertions [
13
].
10.3.3 Optimization Iterations
Since LIMA is an optimization problem, over-optimization of the enhancement
parameters to a specific noise condition, speaker, or subset of acoustic state models
is highly possible and should be avoided. This suggests that the number of optimi-
zation iterations should not be large in order to maintain generality across
conditions, but too little iteration may result in the LIMA framework operating
less effectively than a standard enhancement system. Considering real-time opera-
tion (another important consideration for in-car ASR) also points to limited
iterations.
To address this issue, two experiments were designed to determine a suitable
balance between ASR performance and pseudo real-time operation using the noise-
only calibration framework described in Sect.
10.3.4
. This framework was used
since the belief was that noise conditions have a greater effect on the resulting
enhancement parameters than individual speakers since speaker-independent
acoustic models are being used.
In the first experiment, the number of gradient-descent iterations was varied
whilst using a single joint optimization iteration (i.e., full recognition and parameter
optimization cycles). The second experiment varied the number of joint optimiza-
tion iterations whilst the gradient-descent iterations (determined from the former
experiment) were kept constant. The combined outcomes of these experiments
dictated the levels of optimization used for assessing the frameworks detailed in
Sect.
10.3.4
.
For all experiments, the enhancement parameters were initialized to
1 for
all 26 Mel-filterbanks. These values were an appropriate initial guess since standard
MFNS using these values provides improvements in speech recognition accuracy
over a system without enhancement [
10
].
að
k
Þ¼
Search WWH ::
Custom Search