Audio-Visual Fusion for Film Database Retrieval and Classification - Multimedia Database Retrieval: Technology and Applications

Database Reference

In-Depth Information

10.4.1

Decision Fusion Model

Meyer et al. [ 309 ] have suggested that there are two techniques for audio-visual

fusion in developing audio-visual information recognition systems: feature fusion

and decision fusion. These approaches are very often referred to as early fusion and

late fusion. The first approach is a simple audio-visual feature concatenation giving

rise to a single data representation before the pattern matching stage. The second

approach applies separate pattern matching algorithms to audio and video data and

then merges the estimated likelihoods of the single-modality matching decisions.

In the current application, a late fusion scheme is chosen for the following two

reasons. First, visual features in the current work have a physical structure different

from audio features in terms of dimensions as well as in weighting schemes. Second,

based on previous studies [ 308 ] with respect to human perception, audio and visual

processing are likely to be carried out independently in different modalities and

combined at a very late stage. Audio contains information that is often not available

in the visual signal [ 333 ]; thus, it may not be appropriate to concatenate audio and

visual features into a single representation.

Figure 10.5 shows the architecture of the system which includes the fusion

module and SVM. The extracted data (audio and visual) are processed by different

similarity functions, d a and d v . The function d a is applied to audio features, whereas

the function d v is applied to visual features. Each function, given the extracted

data, will deliver similarity scores between an input sample and a model vector.

These scores range between zero (accept) and one (reject). In other words, when

combining two modules, the fusion algorithm processes a two-dimensional vector

for which each component is a score in [0, 1] delivered by the corresponding

modality expert. The SVM will combine the opinions of the different experts and

give a binary decision.

Let f a and f v denote the feature vectors extracted from the audio and video

signals, where the subscript a and v are for audio and visual, respectively. Given

the video database, we can obtain a set of samples:

f a )

f v )] ,

x i =[

d a , i (

f a , i ,

d v , i (

f v , i ,

,...,

N T

(10.17)

where

− f a , i −

f a

d a , i =

(10.18)

f v

f v , i ·

d v , i =

(10.19)

f v

f v , i ×

d i is the function measuring the similarity between the i -th sample f i and f that is

the feature vector of the representative sample from the positive class. d a , i and d v , i

are computed from the audio and visual domain, respectively. From a given video

Multimedia Database Retrieval: Technology and Applications

Search WWH ::

Custom Search

Home