Digital Signal Processing Reference
In-Depth Information
Table 10.3
Mean isolated digit recognition rates in [%] UA / WA for different noise types, noise
compensation strategies, and features, training on clean data
Strategy
Features
Clean
UA / WA [%]
CAR
BAB
WGN
SLDM
MFCC
99.92
99.52
99.29
87.79
HEQ
MFCC
99.92
98.21
96.53
77.50
CMS
PLP
99.84
97.70
97.92
72.67
MVN
MFCC
99.84
94.86
93.32
79.06
CMS
MFCC
99.84
96.96
97.18
72.22
HEQ
PLP
99.92
97.20
95.27
66.51
HCRF/CMS
MFCC
99.76
95.67
94.97
70.06
USS
MFCC
99.05
93.52
92.27
53.19
AFE
MFCC
100.0
87.85
92.84
64.14
None
PLP
99.92
81.06
90.58
67.72
None
MFCC
99.92
75.09
88.37
63.67
AR-SLDS
None
97.37
47.24
78.51
93.32
SAR-HMM
None
98.10
54.26
83.16
41.91
sequences were used for global SLDM training. This captures the dynamics of clean
speech. The speech model consists of 32 hidden states, and the utterance-specific
noise model of a single Gaussian mixture component. It was trained on the first and
last 10 frames of the noisy test utterance. To speed up the calculation, the algorithm
for speech enhancement was run with history parameter
r
more demanding recognition tasks like the INTERSPEECH Consonant Challenge
[
23
], SLDM feature enhancement was proven to increase recognition rates for noisy
speech: The technique cannot compete with strategies using perfect knowledge of
the local SNR of time-frequency components in the spectrogram like oracle masks
[
24
-
26
], however, compared to the Consonant Challenge HMM baseline recogniser
[
23
], the SLDM approach can improve noisy speech recognition rates by up to 174 %
relative [
27
]. HCRF for the classification of features enhanced by CMS did not result
in a better recognition rate as compared to using HMM. For WGN disturbance, the
best recognition rate (93.3 % WA, averaged over the different SNR conditions) is
is in this case modelled in the time domain as an AR process. As explained in
The AR-SLDS used in the experiment is based on a 10th order SAR-HMM with ten
states. This concept is, however, not suited for low pass noise at negative SNR levels
in these experiments: For the CAR noise type, only 47.2 % WA are reached, averaged
over all car types and driving conditions, for AR-SLDS modelling. A reason for this
spectrum. In case of a HMM-based recogniser without feature enhancement, PLP
features perform slightly better than MFCC features.
=