Graphics Reference
In-Depth Information
input was used to estimate the probabilities of dialogue acts, which
were represented by weights in the finite-state machines.
Another approach that fuses emotional states with natural language
dialogue acts has been presented by Crook et al. (2012) who integrated
a system to recognize emotions from speech developed by Vogt et al.
(2008) into a natural language dialogue system order to improve the
robustness of a speech recognizer. Their system fuses emotional states
recognized from the acoustics of speech with sentiments extracted
from the transcript of speech. For example, when the users employ
words to express their emotional state that are not included in the
dictionary, the system would still be able to recognize their emotions
from the acoustics of speech.
6. Conclusion and Future Work
In this chapter, we discussed approaches to fuse semantic information
in dialogue systems as well as approaches to fuse social and emotional
cues. While the fusion of semantic information has been strongly
influenced by research done in the natural language community,
the fusion of social signals has heavily relied on techniques from
the multimedia community. Recently, the use of virtual agents and
robots in dialogue systems has led to stronger interactions between
the two areas of research. The use of social signal processing in
dialogue systems may not only improve the quality of interaction,
but also increase their robustness. Vice versa research in social
signal processing may profit from techniques developed for semantic
fusion. Most systems that integrate mechanisms for the multimodal
fusion of social signals in human-agent dialogue only consider a
supplementary use of multiple signals. That is, the system responds to
each cue individually, but does not attempt to resolve ambiguities by
considering additional modalities. One difficulty lies in the fact that
data have to be integrated in an incremental fashion while mechanisms
for social signal fusion usually start from global statistics over longer
segments. A promising avenue for future research might be to research
to what extent techniques from semantic fusion might be included to
exploit the complementary use of social signals. Among other things,
this implies a departure from the fusion of low-level features in favor
of higher level social cues, such as head nods or laughters.
Acknowledgement
The work described in this chapter is partially funded by the EU under
research grants CEEDS (FP7-ICT2009-5), TARDIS (FP7-ICT-2011-7) and
ILHAIRE (FP7-ICT-2009-C).
Search WWH ::




Custom Search