Graphics Reference
In-Depth Information
W3C EmotionML (Emotion Markup Language) has been proposed
as a technology to represent and process emotion-related data and
to enable the interoperability of components dedicated to emotion-
oriented computing. An attempt towards a language that is not limited
to the representation of emotion-related data, but directed to the
representation of nonverbal behavior in general has been recently made
by Scherer et al. (2012) with PML (Perception Markup Language). As
in the case of semantic fusion, the authors identified a specific need
to represent uncertainties in the interpretation of data. For example,
a gaze away from the interlocutor may signal a moment of high
concentration, but also be an indicator of disengagement.
4.2 Acted versus spontaneous signals
Most emotion recognition systems still rely exclusively on acted
data for which very promising results have been obtained. The way
emotions are expressed by actors may be called prototypical, and
independent observers would largely agree on the emotional state of
these speakers. A common example includes voice data from actors for
which developers of emotion recognition systems reported accuracy
rates of over 80% for seven emotion classes. In realistic applications,
there is, however, no guarantee that emotions are expressed in a
prototypical manner. As a consequence, these applications still
represent a great challenge for current emotion recognition systems,
and it is obvious to investigate whether the recognition rates obtained
for unimodal non-acted data can be improved by considering multiple
modalities.
Unfortunately, the gain obtained by multimodal fusions seems to
be lower for non-acted than for acted data. Based on a comprehensive
analysis of state-of-the-art approaches to affect recognition, D'Mello
and Kory (2012) report on an average improvement of 8.12% for
multimodal affect recognition compared to unimodal affect recognition
while the improvement was significantly higher for acted data (12.1%)
than for spontaneous data (4.39%).
One explanation might be that experienced actors are usually
able to express emotions consistently across various channels while
natural speakers do not have this capacity. For example, Wagner et
al. (2011a) found that natural speakers they recorded for the CALLAS
corpus were more expressive in their speech than in their face or
gestures—probably due to the fact that the method they used to elicit
emotions in people mainly affected vocal emotions. As a consequence,
they did not obtain a high gain for multimodal fusion compared to
the unimodal speech-based emotion classifier. At least, they were able
Search WWH ::




Custom Search