Graphics Reference
In-Depth Information
Good company makes short miles. Dutch.
7. Multi-modal Assessment and Multi-modal Fusion
for Emotion and Disposition Recognition
Recognizing the users' emotional and dispositional states can be achieved
by analyzing different modalities, e.g. analyzing facial expressions, body
postures, and gestures, or detecting and interpreting paralinguistic
information hidden in speech (see section on measurements above). In
addition to these types of external signals, psychobiological channels
can provide information about the user's current emotional state
(honest signals sensu Pentland and Pentland, 2008). Although emotion
recognition is often performed on single modalities, particularly in
benchmark studies on acted emotional data, for the recognition of
more naturalistic emotions, multi-modal events or states need to be
considered, and principles of multi-modal pattern recognition become
increasingly popular (Caridakis et al., 2007; Walter et al., 2011).
Basically, any multi-modal classification problem can be treated
as a uni-modal one, just by extracting relevant data or feature vectors
from each modality and concatenating them in a single vector that
is then applied as input vector to a single monolithic classifier. This
fusion scheme is called data fusion , early fusion or low-level fusion.
The opposite of data fusion is decision fusion, late fusion or high-level
fusion . This means that information of different modalities is processed
separately until the classifier decisions were computed. After that an
aggregation rule is applied combining the entire bunch of decisions
into a final overall decision. All these different notions are reflecting
the processing level (data/decision, early/late, or low/high) where
information fusion takes place. In addition to these two principles,
feature (level) fusion or intermediate (level) fusion or mid (level) fusion is
a common fusion scheme. This notion is used to express the fact that
information sources are fused after computing some type of higher-
level discriminative features, e.g. action-unit intensities, statistics of
spoken words, speech content (Schwenker et al., 2006).
Besides the spatial fusion types of different modalities, in multi-
modal data streams the integration of temporal information is required
(Dietrich et al., 2003). In human-computer interaction scenarios of
typical events in the environment, the user's states or actions cannot be
detected or classified on the basis of single video frames or short-time
speech analysis windows (Glodek et al., 2011b). Usually such events
or states are represented through multi-variate time series and thus
fusion in these applications almost always means both spatial and
temporal information fusion. The simplest temporal fusion scheme
Search WWH ::




Custom Search