Graphics Reference
In-Depth Information
is used, for instance, by Johnston et al. (2011) in the MTalk system, a
multimodal browser for location-based services.
3.6 Semantic fusion in human-agent interaction
A number of multimodal dialogue systems make use of a virtual agent
in order to allow for more natural interaction. Typically, these systems
employ graphical displays to which a user may refer to using touch
or mouse gestures in combination with spoken or written natural
language input; for example, Martin et al. (2006), Wahlster (2003)
or Hofs et al. (2010). Furthermore, the use of freehand arm gestures
(Sowa et al., 2001) and eye gaze (Sun et al., 2008) to refer to objects
in a 3D environment has been explored in interactions with virtual
agents. Techniques for multimodal semantic fusion have also attracted
interest in the area of human-robot interaction. In most systems, the
user's hands are tracked to determine objects or locations the user is
referring to via natural language; for example, Burger et al. (2011).
In addition to the recognition of hand gestures, Stiefelhagen et al.
(2004) make use of head tracking based on the consideration that users
typically look at the objects they refer to.
While some of the agent-based dialogue systems employ unifi cation-
based grammars (Wahlster, 2003) or chart starts (Sowa et al., 2001)
as presented in Section 3.1, others use a hybrid fusion mechanism
combining declarative formalisms, such as frames, with procedural
elements (Martin et al., 2006). Often the fusion of semantic information
is triggered by natural language components which detect a need to
integrate information from another modality (Stiefelhagen et al., 2004).
In addition, attempts have been made to consider how multimodal
information is analyzed and produced by humans in the semantic fusion
process. Usually what is being said becomes not immediately clear, but
requires multiple turns between two interlocutors. Furthermore, people
typically analyze speech in an incremental manner while it is spoken
and provide feedback to the speaker before the utterance is completed.
For example, a listener may signal by a frown that an utterance
is not fully understood. To simulate such a behavior in human-
agent interaction, a tight coupling of multimodal analysis, dialogue
processing and multimodal generation is required. Stiefelhagen et al.
(2007) propose to allow for clarification dialogues in order to improve
the accuracy of the fusion process in human-robot dialogue. Visser et
al. (2012) describe an incremental model of grounding that enables the
simulation of several grounding acts, such as initiate, acknowledge,
request and repair, in human-agent dialogue. If the virtual agent is
not able to come up with a meaning for the user's input, it generates
Search WWH ::




Custom Search