Graphics Reference
In-Depth Information
Usually, multimodal input systems combine several n -best
hypotheses produced by multiple modality-specific generators. This
leads to several possibilities of fusion, each with a score computed
as a weighted sum of the recognition scores provided by individual
modalities. In this vein, it may happen that a badly ranked hypothesis
may still contribute to the overall semantic representation because it
is compatible with other hypotheses. Thus, multimodality enables us
to use the strength of one modality to compensate for weaknesses of
others. For example, errors in speech recognition can be compensated
by gesture recognition and vice versa. Oviatt (1999) reported that
12.5% of pen/voice interactions in QuickSet could be successfully
analyzed due to multimodal disambiguation while Kaiser et al. (2003)
even obtained a success rate of 46.4% for speech and 3D gestures that
could be attributed to multimodal disambiguation.
3.5 Desktop vs. mobile environments
More recent work focuses on the challenge to support a speech-based
multimodal interface on heterogeneous devices including not only
desktop PCs, but also mobile devices, such as smart phones (Johnston,
2009).
In addition, there is a trend towards less traditional platforms,
such as in-car interfaces (Gruenstein et al., 2009) or home controlling
interfaces (Dimitriadis and Schroeter, 2011). Such environments raise
particular challenges to multimodal analysis due to the increased noise
level, the less controlled environment and multi-threaded conversations.
In addition, we need to consider that users are continuously producing
multimodal output and not only when interacting with a system. For
example, a gesture performed by a user to greet another user should
not be mixed up with a gesture to control a system. In order to relieve
the users from the burden to explicitly indicate when they wish to
interact, a system should be able to distinguish automatically between
commands and non-commands.
Particular challenges arise in a situated environment because the
information on the user's physical context is required to interpret a
multimodal utterance. For example, a robot has to know its location
and orientation as well as the location of objects in its physical
environment, to execute commands, such as “Move to the table”. In
a mobile application, the GPS location of the device may be used to
constrain search results for a natural language user query. When a user
says “restaurants” without specifying an area on the map displayed on
the phone, the system interprets this utterance as a request to provide
only restaurants in the user's immediate vicinity. Such an approach
Search WWH ::




Custom Search