Applications in Intelligent Speech Analysis - Intelligent Audio Analysis

Digital Signal Processing Reference

In-Depth Information

Table 10.3 Mean isolated digit recognition rates in [%] UA / WA for different noise types, noise

compensation strategies, and features, training on clean data

Strategy

Features

Clean

UA / WA [%]

CAR

BAB

WGN

SLDM

MFCC

99.92

99.52

99.29

87.79

HEQ

MFCC

99.92

98.21

96.53

77.50

CMS

PLP

99.84

97.70

97.92

72.67

MVN

MFCC

99.84

94.86

93.32

79.06

CMS

MFCC

99.84

96.96

97.18

72.22

HEQ

PLP

99.92

97.20

95.27

66.51

HCRF/CMS

MFCC

99.76

95.67

94.97

70.06

USS

MFCC

99.05

93.52

92.27

53.19

AFE

MFCC

100.0

87.85

92.84

64.14

None

PLP

99.92

81.06

90.58

67.72

None

MFCC

99.92

75.09

88.37

63.67

AR-SLDS

None

97.37

47.24

78.51

93.32

SAR-HMM

None

98.10

54.26

83.16

41.91

SLDM for speech and a LDM for noise (cf. Sect. 9.2.2 ) . Thereby all clean training

sequences were used for global SLDM training. This captures the dynamics of clean

speech. The speech model consists of 32 hidden states, and the utterance-specific

noise model of a single Gaussian mixture component. It was trained on the first and

last 10 frames of the noisy test utterance. To speed up the calculation, the algorithm

for speech enhancement was run with history parameter r

1 (cf. Sect. 9.2.2.4 ) . For

more demanding recognition tasks like the INTERSPEECH Consonant Challenge

[ 23 ], SLDM feature enhancement was proven to increase recognition rates for noisy

speech: The technique cannot compete with strategies using perfect knowledge of

the local SNR of time-frequency components in the spectrogram like oracle masks

[ 24 - 26 ], however, compared to the Consonant Challenge HMM baseline recogniser

[ 23 ], the SLDM approach can improve noisy speech recognition rates by up to 174 %

relative [ 27 ]. HCRF for the classification of features enhanced by CMS did not result

in a better recognition rate as compared to using HMM. For WGN disturbance, the

best recognition rate (93.3 % WA, averaged over the different SNR conditions) is

reached by the AR-SLDS as was explained in Sect. 9.3.3.2 . The noisy speech signal

is in this case modelled in the time domain as an AR process. As explained in

Sect. 9.3.3.2 , the AR-SLDS constitutes the fusion of the SAR-HMM with the SLDS.

The AR-SLDS used in the experiment is based on a 10th order SAR-HMM with ten

states. This concept is, however, not suited for low pass noise at negative SNR levels

in these experiments: For the CAR noise type, only 47.2 % WA are reached, averaged

over all car types and driving conditions, for AR-SLDS modelling. A reason for this

is the assumption that was made in Eq. ( 9.29 ) : additive noise is expected to have a flat

spectrum. In case of a HMM-based recogniser without feature enhancement, PLP

features perform slightly better than MFCC features.

Intelligent Audio Analysis

Search WWH ::

Custom Search

Home