Graphics Reference
In-Depth Information
PAD
Space:
++
++
-
---
--
--
++
+++
++
Intro
1.5min
ES - 1
ca. 3min
ES - 2
ca. 3min
ES - 3
ca. 5min
ES - 4
ca. 3min
ES - 5
ca. 4min
ES - 6
ca. 5min
Figure 6. Experimental sequences the subject is guided through for the recordings.
challenge on word-level, a video challenge on frame-level and an
audiovisual also on video frame-level.
The data was recorded in a human-computer interaction scenario
in which the subjects were instructed to interact with an affectively
colored artificial agent. Audio and video material was collected from
13 different subjects in overall 63 recordings. The recorded data was
labeled in four affective dimensions: arousal , expectancy , power and
valence . The annotations of the raters have been averaged for each
dimension, resulting in a real value for each time step. Subsequently,
the labels are binarized using a threshold equal to the grand mean of
each dimension. Two to eight raters annotated every recording. Along
with the sensor data and the annotations, a word-by-word transcription
of the spoken language was provided which partitions the dialog
into conversational turns. For the evaluation of the challenge, only
arousal was taken into account as classification of the other dimensions
yielded poor results. 1 ((Schuller et al., 2011; McKeown et al., 2010) for
a detailed description of the data set.)
3.2 Features
In the following, the proposed method for the classification from
multiple sources is described. We begin with the description of the
individual modalities and extracted features.
3.2.1 Audio Features
From the audio signal, the following features have been applied:
￿ The fundamental frequency values are extracted using the f 0 tracker
available in the ESPS/ waves+ 2 software package. Besides the
track, the energy and the linear predictive coding (LPC) of the
plain wave signal is extracted (Hermansky, 1990). All three
1 http://sspnet.eu/avec2011/
2 http://www.speech.kth.se/software/
 
Search WWH ::




Custom Search