Graphics Reference
In-Depth Information
Kaiser et al. (2003) applied unification over typed feature structures
to analyze multimodal input consisting of speech, 3D gestures and
head direction in augmented and virtual reality. Noteworthy is the
fact that the system went beyond gestures referring to objects, but
also considered gestures describing how actions should be performed.
Among others, the system was able to interpret multimodal rotation
commands, such as “Turn the table <rotation gesture> clockwise.”
where the gesture specified both the object to be manipulated and
the direction of rotation.
Another popular approach that was inspired by work on natural
language analysis used finite-state machines consisting of n + 1
tapes which represent the n input modalities to be analyzed and
their combined meaning (Bangalore and Johnston, 2009). When
analyzing a multimodal utterance, lattices that correspond to possible
interpretations of the single input streams are created by writing
symbols on the corresponding tapes. Multiple input streams are then
aligned by transforming their lattices into a lattice that represents
the combined semantic interpretation. Temporal constraints are not
explicitly encoded as in the unification-based approaches described
above, but implicitly given by the order of the symbols written on
the single tapes. Approaches to represent temporal constraints within
state chart mechanisms have been presented by Latoschik (2002) and
more recently by Mehlmann and André (2012).
3.2 Semantic representation of fusion input
A fundamental problem of the very early systems was that there was
no declarative formalism for the formulation of integration constraints.
A noteworthy exception was the approach used in QuickSet which
clearly separated the statements of the multimedia grammar from
the mechanisms of parsing (Johnston, 1998). This approach enabled
not only the declarative formulation of type constraints, such as “the
location of a flood zone should be an area”, but also the specification
of spatial and temporal constraints, such as “two regions should be a
limited distance apart” and “the time of speech must either overlap
with or start within four seconds of the time of the gesture”.
Many recent multimodal input systems, such as SmartKom
(Wahlster 2003), make use of an XML language for representing
messages exchanged between software modules. An attempt to
standardize such a representation language has been made by the World
Wide Web Consortium (W3C) with EMMA (Extensible MultiModal
Annotation markup language). It enables the representation of
characteristic features of the fusion process: “composite” information
Search WWH ::




Custom Search