Graphics Reference
In-Depth Information
2.2 Levels of integration
Basically, two main fusion architectures have been proposed in the
literature depending on at which level sensor data are fused.
In the case of low-level fusion , the input from different sensors is
integrated at an early stage of processing. Low-level fusion is therefore
often also called early fusion . The fusion input may consist of either
raw data or low-level features, such as pitch. The advantage of low-
level fusion is that it enables a tight integration of modalities. There
is, however, no declarative representation of the relationship between
various sensor data which aggravates the interpretation of recognition
results.
In the case of high-level fusion , low-level input has to pass modality-
specific analyzers before it is integrated, e.g. by summing recognition
probabilities to derive a final decision. High-level fusion occurs at a
later stage of processing and is therefore often also called late fusion .
The advantage of high-level fusion is that it allows for the definition of
declarative rules to combine the interpreted results of various sensors.
There is, however, the danger that information goes lost because of a
too early abstraction process.
3. Multimodal Interfaces Featuring Semantic Fusion
In this section, we focus on semantic fusion that combines the meaning
of the single modalities into a uniform representation.
3.1 Techniques for semantic fusion
Systems aiming at a semantic interpretation of multimodal input
typically use a late fusion approach at a decision level and process
each modality individually before fusion (see Figure 1a). Usually,
they rely on mechanisms that have been originally introduced for the
analysis of natural language.
Johnston (1998) proposed an approach to modality integration
for the QuickSet system that was based on unification over typed
feature structures. The basic idea was to build up a common semantic
representation of the multimodal input by unifying feature structures
which represented the semantic contributions of the single modalities.
For instance, the system was able to derive a partial interpretation for
a spoken natural language reference which indicated that the location
of the referent was of type “point”'. In this case, only unification with
gestures of type “point”' would succeed.
Search WWH ::




Custom Search