Graphics Reference
In-Depth Information
or linguistically motivated time intervals, such as sentences. Kim et
al. (2005) suggested for a corpus consisting of speech and biosignals
choosing the borders of the single segments in such a way that it lies
in the middle between two spoken utterances. Lingenfelser et al. (2011)
used the time interval covered by a spoken utterance for all considered
modalities, i.e. audio and video. These strategies suffer from two
major problems. First, significant hints for emotion recognition from
different modalities are not guaranteed to emerge at exactly the same
time interval. Second, they might occur in a shorter time period than
a sentence only. Classification accuracy could be expected to improve,
if modalities were segmented individually and the succession and
corresponding delays between occurrences of emotional hints in
different signals could be investigated more closely. A promising step
into this direction is the event-based fusion mechanism developed for
the Callas Emotional Tree (Gilroy et al., 2011). Rather than computing
global statistics in a segmentation-based manner, the approach aims
to identify changes in the modality-specific expression of an emotion
and is thus able to continuously respond to emotions of users while
they are interacting with the system.
4.5 Dealing with imperfect data in the fusion process
Most algorithms for social signal fusion start from the assumption that
all data from the different modalities are available at all time. As long
as a system is used offline, only this condition can be easily met by
analyzing the data beforehand and omitting parts where input from
one modality is corrupted or completely missing. However, in online
mode, a manual pre-selection of data is not possible and we have
to find adequate ways of handling missing information. Generally,
various reasons for missing information can be identified. First of
all, it is unrealistic to assume that a person continuously provides
meaningful data for each modality. Second, there may be technical
issues, such as noisy data due to unfortunate environmental conditions
or missing data due to the failure of a sensor. As a consequence, a
system needs to be able to dynamically decide which channels to
exploit in the fusion process and to what extent the present signals
can be trusted. For the case that data is partially missing a couple of
treatments have been suggested in literature, such as the removal of
noise or the interpolation of missing data from available data. Wagner
et al. (2011a) present a comprehensive study that successfully applies
adaptations of state-of-the-art fusion techniques to the missing data
problem in multimodal emotion recognition.
Search WWH ::




Custom Search