Multimodal Fusion in Human-Agent Dialogue - Coverbal Synchrony in Human-Machine Interaction

Graphics Reference

In-Depth Information

Kaiser et al. (2003) applied unification over typed feature structures

to analyze multimodal input consisting of speech, 3D gestures and

head direction in augmented and virtual reality. Noteworthy is the

fact that the system went beyond gestures referring to objects, but

also considered gestures describing how actions should be performed.

Among others, the system was able to interpret multimodal rotation

commands, such as “Turn the table <rotation gesture> clockwise.”

where the gesture specified both the object to be manipulated and

the direction of rotation.

Another popular approach that was inspired by work on natural

language analysis used finite-state machines consisting of n + 1

tapes which represent the n input modalities to be analyzed and

their combined meaning (Bangalore and Johnston, 2009). When

analyzing a multimodal utterance, lattices that correspond to possible

interpretations of the single input streams are created by writing

symbols on the corresponding tapes. Multiple input streams are then

aligned by transforming their lattices into a lattice that represents

the combined semantic interpretation. Temporal constraints are not

explicitly encoded as in the unification-based approaches described

above, but implicitly given by the order of the symbols written on

the single tapes. Approaches to represent temporal constraints within

state chart mechanisms have been presented by Latoschik (2002) and

more recently by Mehlmann and André (2012).

3.2 Semantic representation of fusion input

A fundamental problem of the very early systems was that there was

no declarative formalism for the formulation of integration constraints.

A noteworthy exception was the approach used in QuickSet which

clearly separated the statements of the multimedia grammar from

the mechanisms of parsing (Johnston, 1998). This approach enabled

not only the declarative formulation of type constraints, such as “the

location of a flood zone should be an area”, but also the specification

of spatial and temporal constraints, such as “two regions should be a

limited distance apart” and “the time of speech must either overlap

with or start within four seconds of the time of the gesture”.

Many recent multimodal input systems, such as SmartKom

(Wahlster 2003), make use of an XML language for representing

messages exchanged between software modules. An attempt to

standardize such a representation language has been made by the World

Wide Web Consortium (W3C) with EMMA (Extensible MultiModal

Annotation markup language). It enables the representation of

characteristic features of the fusion process: “composite” information

Search WWH ::

Custom Search

Home