Digital Signal Processing Reference
In-Depth Information
where m is the index of the Gaussian mixture component, M is the total number of
mixtures, c jm is the mixture weight such that
M
c jm ¼
1
;
(1.3)
1
n is the dimension of o t ; S jm is the mixture covariance matrix, and m jm is the mixture
mean vector. The GMM representing neutral speech was trained on the passenger
conversations and the stressed speech GMM on joint Tell-Me and AA conversations
from the training set. In the neutral/stress classification task, the winning model is
selected using a maximum likelihood criterion:
8
<
T
T
1
;
log b 1 ð
ð
o t Þ
Þ
log b 2 ð
ð
o t Þ
Þ
Th
;
t
¼
1
t
¼
1
j win ¼
(1.4)
T
T
:
2
;
log b 1 ð
ð
o t Þ
Þ
log b 2 ð
ð
o t Þ
Þ<
Th
;
1
1
where t is the time frame index, T is the total number of frames in the classified
utterance, and Th is the decision threshold.
In our experiments, the frame length was set to 25 ms, skip rate 10 ms, and the
decision threshold to a fixed value Th
0. Depending on the feature extraction
scheme, the GMMs comprise 32-64 mixtures, and only diagonals are calculated in
the covariance matrices. Unless otherwise specified, c 0 - c 12 form the static obser-
vation feature vector. In all evaluation setups, delta and acceleration coefficients are
extracted from the static coefficients and complete the feature vector. A variety of
features, including Mel Frequency Cepstral Coefficients (MFCC), are considered.
In the UTDrive sessions, the amount of neutral spontaneous conversation data
considerably exceeds the number of Tell-Me and AA samples. In this case, possible
misclassification of small amount of stressed samples would have little effect on the
overall classification accuracy, while classifying correctly only neutral data would
assure high overall accuracy. To eliminate the impact of different sizes of the
neutral and stressed sets, and to allow for accuracy-based selection of the optimal
front-end for both AA and Tell-Me conversation scenarios, the overall classification
accuracy is determined as
¼
2 Acc NN þ
Acc TellMeS þ
Acc AAS
Acc
¼
(%),
(1.5)
4
where Acc N-N is the accuracy of neutral samples being classified as neutral,
Acc TellMe-S is the accuracy of Tell-Me samples being classified as stressed, and
Acc AA-S is the accuracy of AA samples being classified as stressed.
Efficiency of several feature extraction front-endswas evaluated in the neutral/stress
classification task. In particular, Mel Frequency Cepstral Coefficients (MFCC [ 19 ]),
Perceptual Linear Prediction (PLP) cepstral coefficients [ 20 ], Expolog cepstra [ 21 ],
Search WWH ::




Custom Search