Towards Multimodal Driver’s Stress Detection - Digital Signal Processing for In-Vehicle Systems and Safety

Digital Signal Processing Reference

In-Depth Information

where m is the index of the Gaussian mixture component, M is the total number of

mixtures, c jm is the mixture weight such that

c jm ¼

;

(1.3)

m¼ 1

n is the dimension of o t ; S jm is the mixture covariance matrix, and m jm is the mixture

mean vector. The GMM representing neutral speech was trained on the passenger

conversations and the stressed speech GMM on joint Tell-Me and AA conversations

from the training set. In the neutral/stress classification task, the winning model is

selected using a maximum likelihood criterion:

;

log b 1 ð

o t Þ

log b 2 ð

o t Þ

;

j win ¼

(1.4)

;

log b 1 ð

o t Þ

log b 2 ð

o t Þ

Þ<

;

t¼ 1

where t is the time frame index, T is the total number of frames in the classified

utterance, and Th is the decision threshold.

In our experiments, the frame length was set to 25 ms, skip rate 10 ms, and the

decision threshold to a fixed value Th

0. Depending on the feature extraction

scheme, the GMMs comprise 32-64 mixtures, and only diagonals are calculated in

the covariance matrices. Unless otherwise specified, c 0 - c 12 form the static obser-

vation feature vector. In all evaluation setups, delta and acceleration coefficients are

extracted from the static coefficients and complete the feature vector. A variety of

features, including Mel Frequency Cepstral Coefficients (MFCC), are considered.

In the UTDrive sessions, the amount of neutral spontaneous conversation data

considerably exceeds the number of Tell-Me and AA samples. In this case, possible

misclassification of small amount of stressed samples would have little effect on the

overall classification accuracy, while classifying correctly only neutral data would

assure high overall accuracy. To eliminate the impact of different sizes of the

neutral and stressed sets, and to allow for accuracy-based selection of the optimal

front-end for both AA and Tell-Me conversation scenarios, the overall classification

accuracy is determined as

2 Acc NN þ

Acc TellMeS þ

Acc AAS

Acc

(%),

(1.5)

where Acc N-N is the accuracy of neutral samples being classified as neutral,

Acc TellMe-S is the accuracy of Tell-Me samples being classified as stressed, and

Acc AA-S is the accuracy of AA samples being classified as stressed.

Efficiency of several feature extraction front-endswas evaluated in the neutral/stress

classification task. In particular, Mel Frequency Cepstral Coefficients (MFCC [ 19 ]),

Perceptual Linear Prediction (PLP) cepstral coefficients [ 20 ], Expolog cepstra [ 21 ],

Digital Signal Processing for In-Vehicle Systems and Safety

Search WWH ::

Custom Search

Home