Multimodal Fusion in Human-Agent Dialogue - Coverbal Synchrony in Human-Machine Interaction

Graphics Reference

In-Depth Information

Usually, multimodal input systems combine several n -best

hypotheses produced by multiple modality-specific generators. This

leads to several possibilities of fusion, each with a score computed

as a weighted sum of the recognition scores provided by individual

modalities. In this vein, it may happen that a badly ranked hypothesis

may still contribute to the overall semantic representation because it

is compatible with other hypotheses. Thus, multimodality enables us

to use the strength of one modality to compensate for weaknesses of

others. For example, errors in speech recognition can be compensated

by gesture recognition and vice versa. Oviatt (1999) reported that

12.5% of pen/voice interactions in QuickSet could be successfully

analyzed due to multimodal disambiguation while Kaiser et al. (2003)

even obtained a success rate of 46.4% for speech and 3D gestures that

could be attributed to multimodal disambiguation.

3.5 Desktop vs. mobile environments

More recent work focuses on the challenge to support a speech-based

multimodal interface on heterogeneous devices including not only

desktop PCs, but also mobile devices, such as smart phones (Johnston,

2009).

In addition, there is a trend towards less traditional platforms,

such as in-car interfaces (Gruenstein et al., 2009) or home controlling

interfaces (Dimitriadis and Schroeter, 2011). Such environments raise

particular challenges to multimodal analysis due to the increased noise

level, the less controlled environment and multi-threaded conversations.

In addition, we need to consider that users are continuously producing

multimodal output and not only when interacting with a system. For

example, a gesture performed by a user to greet another user should

not be mixed up with a gesture to control a system. In order to relieve

the users from the burden to explicitly indicate when they wish to

interact, a system should be able to distinguish automatically between

commands and non-commands.

Particular challenges arise in a situated environment because the

information on the user's physical context is required to interpret a

multimodal utterance. For example, a robot has to know its location

and orientation as well as the location of objects in its physical

environment, to execute commands, such as “Move to the table”. In

a mobile application, the GPS location of the device may be used to

constrain search results for a natural language user query. When a user

says “restaurants” without specifying an area on the map displayed on

the phone, the system interprets this utterance as a request to provide

only restaurants in the user's immediate vicinity. Such an approach

Search WWH ::

Custom Search

Home