Multi-Modal Classifier-Fusion for the Recognition of Emotions - Coverbal Synchrony in Human-Machine Interaction

Graphics Reference

In-Depth Information

Table 2.

Accuracies of every multi-modal combination. Results in percent with standard

deviation.

Combination

Accuracy

Audio (1) + Video (1)

62.0% (15%)

Video (1) + Physiology (1)

59.7% (13.4%)

Audio (1) + Physiology (1)

60.2% (9.3%)

Video (1) + Audio (1) + Physiology (1)

60.8% (9.1%)

Video (1) + Audio (2) + Physiology (3)

61.5% (12.6%)

3.3.2 AVEC 2011 Data Collection

For each label dimension and for each audio feature, a bag of

hidden Markov models (HMMs) have been trained (Breimann, 1996;

Rabiner, 1989). The hidden states and the number of mixture

components of the HMM have been optimized using a parameter

search, resulting in the selection of three hidden states and two

mixture components in the Gaussian mixture model (GMM) having

full covariance matrices.

The evaluation of the optimization process further inferred that

some features appear to be inappropriate to detect certain labels. It

turned out that only the label arousal can draw information from all

features, expectancy and power performed better using only the energy,

fundamental frequency and the MFCC features. The label valance

favored only the MFCC features. For each label, the log-likelihoods

of every HMM, trained on the features, are summed. To obtain more

robust models, we decided to additionally use five times as many

models per class and summed the outcome as well.

Furthermore, the assumption was made that the labels are changing

only slowly over time. We therefore conducted the classification on

turn basis by collecting the detections within one turn and multiplied

the likelihoods to obtain more robust detections. A schema visualizing

the applied fusion architecture is shown in Figure 9. The results of

this approach are reported in Table 3.

Within the video challenge, the n -SVM was employed as base

classifier (Schölkopf et al., 2000). The implementation was taken from

the well-known LibSVM repository. We concatenated 300 form and 300

motion features and used them to train a n -SVM using a linear kernel

and probabilistic outputs according to Platt (1999). Due to memory

constraints, only 10.000 randomly drawn samples were used.

Again a parameter search was applied to obtain suitable parameters,

resulting in setting n = 0.3 for arousal and power and n = 0.7 for

expectancy and valence . Based on the results of all label dimensions,

Coverbal Synchrony in Human-Machine Interaction

Search WWH ::

Custom Search

Home