Multimodal Fusion in Human-Agent Dialogue - Coverbal Synchrony in Human-Machine Interaction

Graphics Reference

In-Depth Information

2.2 Levels of integration

Basically, two main fusion architectures have been proposed in the

literature depending on at which level sensor data are fused.

In the case of low-level fusion , the input from different sensors is

integrated at an early stage of processing. Low-level fusion is therefore

often also called early fusion . The fusion input may consist of either

raw data or low-level features, such as pitch. The advantage of low-

level fusion is that it enables a tight integration of modalities. There

is, however, no declarative representation of the relationship between

various sensor data which aggravates the interpretation of recognition

results.

In the case of high-level fusion , low-level input has to pass modality-

specific analyzers before it is integrated, e.g. by summing recognition

probabilities to derive a final decision. High-level fusion occurs at a

later stage of processing and is therefore often also called late fusion .

The advantage of high-level fusion is that it allows for the definition of

declarative rules to combine the interpreted results of various sensors.

There is, however, the danger that information goes lost because of a

too early abstraction process.

3. Multimodal Interfaces Featuring Semantic Fusion

In this section, we focus on semantic fusion that combines the meaning

of the single modalities into a uniform representation.

3.1 Techniques for semantic fusion

Systems aiming at a semantic interpretation of multimodal input

typically use a late fusion approach at a decision level and process

each modality individually before fusion (see Figure 1a). Usually,

they rely on mechanisms that have been originally introduced for the

analysis of natural language.

Johnston (1998) proposed an approach to modality integration

for the QuickSet system that was based on unification over typed

feature structures. The basic idea was to build up a common semantic

representation of the multimodal input by unifying feature structures

which represented the semantic contributions of the single modalities.

For instance, the system was able to derive a partial interpretation for

a spoken natural language reference which indicated that the location

of the referent was of type “point”'. In this case, only unification with

gestures of type “point”' would succeed.

Search WWH ::

Custom Search

Home