Multimodal Fusion in Human-Agent Dialogue - Coverbal Synchrony in Human-Machine Interaction

Graphics Reference

In-Depth Information

(resulting from the fusion of several modalities), confidence scores,

timestamps as well as incompatible interpretations (“one-of”). Johnston

(2009) presents a variety of multimodal interfaces combining speech-,

touch- and pen-based input that have been developed using the EMMA

standard.

3.3 Choice of segments to be considered

in the fusion process

Most systems start from the assumption that the complete input

provided by the user can be integrated. Furthermore, they presume

that the start and end points of input in each modality are given, for

example, by requiring the user to explicitly mark it in the interaction.

Under such conditions, the determination of processing units to be

considered in the fusion process is rather straightforward. Typically,

temporal constraints are considered to find the best candidates to

be fused with each other. For example, a pointing gesture should

occur approximately at the same time as the corresponding natural

language expression while it is not necessary that the two modalities

temporally overlap. However, there are cases when such an assumption

is problematic and may present a system from deriving a semantic

interpretation. For example, the input components may by mistake

come up with an erroneous recognition result that cannot be integrated.

Secondly, the user may unintentionally provide input, for example,

by making a gesture that should not be taken as a gesture. In natural

environments where users freely interact, the situation becomes even

harder. Users permanently move their arms, but not every gesture is

meant to be part of a system command. If eye gaze is employed as a

means to indicate a referent, the determination of segments becomes

even challenging. Users tend to fixate the objects with the eye they

refer to. However, not every fixation is supposed to contribute to a

referring expression. A first approach to solve this problem has been

presented by Sun et al. (2009). They propose a multimodal input fusion

approach to flexibly skip spare information in multimodal inputs that

cannot be integrated.

3.4 Dealing with imperfect data in the fusion process

Multimodal interfaces often have to deal with uncertain data. Individual

signals may be noisy and/or hard to interpret. Some modalities may

be more problematic than others. A fusion mechanism should consider

these uncertainties when integrating the modalities into a common

semantic representation.

Search WWH ::

Custom Search

Home