Graphics Reference
In-Depth Information
the most relevant features and train a classifier with the resulting
feature set. Hence, fusion is based on the integration of low-level
features at the feature level (see Figure 1b) and takes place at a rather
early stage of the recognition process.
An alternative would be to fuse the recognition results at the
decision level based on the outputs of separate unimodal classifiers
(see Figure 1c). Here, multiple unimodal classifiers are trained for
each modality individually and the resulting decisions are fused by
using specific weighting rules. In the case of emotion recognition, the
input for the fusion algorithm may consist of either discrete emotion
categories, such as anger or joy, or continuous values of a dimensional
emotion model (e.g. continuous representation of the valence or the
arousal of the emotions). Hence, fusion is based on the integration of
high-level concepts and takes place at a later stage of the recognition
process.
Eyben et al. (2011) propose a mechanism that fuses audiovisual
social behaviors at an intermediate level based on the consideration
that behavioral events, such as smiles, head shakes and laughter,
convey important information on a person's emotional state that might
go lost if information is fused at the level of low-level features or at
the level of emotional states.
Which level of modality integration yields the best results is
usually hard to predict. Busso et al. (2004) report on an emotion-
specific comparison of feature-level and decision-level fusion that
was conducted for an audiovisual database containing four emotions,
sadness, anger, happiness, and neutral state, deliberately posed by an
actress. They observed for their corpus that feature-level fusion was
most suitable for differentiating anger and neutral while decision-level
fusion performed better for happiness and sadness. Caridakis et al.
(2007) presented a multimodal approach for the recognition of eight
emotions that integrated information from facial expressions, body
gestures, and speech. They observed a recognition improvement of
more than 10% compared to the most successful unimodal system
and the superiority of feature-level fusion to decision-level fusion.
Wagner et al. (2011a) tested a comprehensive repertoire of state-of-
the-art fusion techniques including their own emotion-specific fusion
scheme on the acted DaFEx corpus and the more natural CALLAS
corpus. Results were either considerably improved (DaFEx) or at least
in line with the dominating modality (CALLAS). Unlike Caridakis and
colleagues, Wagner and colleagues found that decision-level fusion
yielded more promising results than feature-level fusion.
Search WWH ::




Custom Search