Graphics Reference
In-Depth Information
Table 2.
Accuracies of every multi-modal combination. Results in percent with standard
deviation.
Combination
Accuracy
Audio (1) + Video (1)
62.0% (15%)
Video (1) + Physiology (1)
59.7% (13.4%)
Audio (1) + Physiology (1)
60.2% (9.3%)
Video (1) + Audio (1) + Physiology (1)
60.8% (9.1%)
Video (1) + Audio (2) + Physiology (3)
61.5% (12.6%)
3.3.2 AVEC 2011 Data Collection
For each label dimension and for each audio feature, a bag of
hidden Markov models (HMMs) have been trained (Breimann, 1996;
Rabiner, 1989). The hidden states and the number of mixture
components of the HMM have been optimized using a parameter
search, resulting in the selection of three hidden states and two
mixture components in the Gaussian mixture model (GMM) having
full covariance matrices.
The evaluation of the optimization process further inferred that
some features appear to be inappropriate to detect certain labels. It
turned out that only the label arousal can draw information from all
features, expectancy and power performed better using only the energy,
fundamental frequency and the MFCC features. The label valance
favored only the MFCC features. For each label, the log-likelihoods
of every HMM, trained on the features, are summed. To obtain more
robust models, we decided to additionally use five times as many
models per class and summed the outcome as well.
Furthermore, the assumption was made that the labels are changing
only slowly over time. We therefore conducted the classification on
turn basis by collecting the detections within one turn and multiplied
the likelihoods to obtain more robust detections. A schema visualizing
the applied fusion architecture is shown in Figure 9. The results of
this approach are reported in Table 3.
Within the video challenge, the n -SVM was employed as base
classifier (Schölkopf et al., 2000). The implementation was taken from
the well-known LibSVM repository. We concatenated 300 form and 300
motion features and used them to train a n -SVM using a linear kernel
and probabilistic outputs according to Platt (1999). Due to memory
constraints, only 10.000 randomly drawn samples were used.
Again a parameter search was applied to obtain suitable parameters,
resulting in setting n = 0.3 for arousal and power and n = 0.7 for
expectancy and valence . Based on the results of all label dimensions,
Search WWH ::




Custom Search