Digital Signal Processing Reference
In-Depth Information
Table 10.3 Mean isolated digit recognition rates in [%] UA / WA for different noise types, noise
compensation strategies, and features, training on clean data
Strategy
Features
Clean
UA / WA [%]
CAR
BAB
WGN
SLDM
MFCC
99.92
99.52
99.29
87.79
HEQ
MFCC
99.92
98.21
96.53
77.50
CMS
PLP
99.84
97.70
97.92
72.67
MVN
MFCC
99.84
94.86
93.32
79.06
CMS
MFCC
99.84
96.96
97.18
72.22
HEQ
PLP
99.92
97.20
95.27
66.51
HCRF/CMS
MFCC
99.76
95.67
94.97
70.06
USS
MFCC
99.05
93.52
92.27
53.19
AFE
MFCC
100.0
87.85
92.84
64.14
None
PLP
99.92
81.06
90.58
67.72
None
MFCC
99.92
75.09
88.37
63.67
AR-SLDS
None
97.37
47.24
78.51
93.32
SAR-HMM
None
98.10
54.26
83.16
41.91
SLDM for speech and a LDM for noise (cf. Sect. 9.2.2 ) . Thereby all clean training
sequences were used for global SLDM training. This captures the dynamics of clean
speech. The speech model consists of 32 hidden states, and the utterance-specific
noise model of a single Gaussian mixture component. It was trained on the first and
last 10 frames of the noisy test utterance. To speed up the calculation, the algorithm
for speech enhancement was run with history parameter r
1 (cf. Sect. 9.2.2.4 ) . For
more demanding recognition tasks like the INTERSPEECH Consonant Challenge
[ 23 ], SLDM feature enhancement was proven to increase recognition rates for noisy
speech: The technique cannot compete with strategies using perfect knowledge of
the local SNR of time-frequency components in the spectrogram like oracle masks
[ 24 - 26 ], however, compared to the Consonant Challenge HMM baseline recogniser
[ 23 ], the SLDM approach can improve noisy speech recognition rates by up to 174 %
relative [ 27 ]. HCRF for the classification of features enhanced by CMS did not result
in a better recognition rate as compared to using HMM. For WGN disturbance, the
best recognition rate (93.3 % WA, averaged over the different SNR conditions) is
reached by the AR-SLDS as was explained in Sect. 9.3.3.2 . The noisy speech signal
is in this case modelled in the time domain as an AR process. As explained in
Sect. 9.3.3.2 , the AR-SLDS constitutes the fusion of the SAR-HMM with the SLDS.
The AR-SLDS used in the experiment is based on a 10th order SAR-HMM with ten
states. This concept is, however, not suited for low pass noise at negative SNR levels
in these experiments: For the CAR noise type, only 47.2 % WA are reached, averaged
over all car types and driving conditions, for AR-SLDS modelling. A reason for this
is the assumption that was made in Eq. ( 9.29 ) : additive noise is expected to have a flat
spectrum. In case of a HMM-based recogniser without feature enhancement, PLP
features perform slightly better than MFCC features.
=
 
Search WWH ::




Custom Search