Database Reference
In-Depth Information
10.4.1
Decision Fusion Model
Meyer et al. [ 309 ] have suggested that there are two techniques for audio-visual
fusion in developing audio-visual information recognition systems: feature fusion
and decision fusion. These approaches are very often referred to as early fusion and
late fusion. The first approach is a simple audio-visual feature concatenation giving
rise to a single data representation before the pattern matching stage. The second
approach applies separate pattern matching algorithms to audio and video data and
then merges the estimated likelihoods of the single-modality matching decisions.
In the current application, a late fusion scheme is chosen for the following two
reasons. First, visual features in the current work have a physical structure different
from audio features in terms of dimensions as well as in weighting schemes. Second,
based on previous studies [ 308 ] with respect to human perception, audio and visual
processing are likely to be carried out independently in different modalities and
combined at a very late stage. Audio contains information that is often not available
in the visual signal [ 333 ]; thus, it may not be appropriate to concatenate audio and
visual features into a single representation.
Figure 10.5 shows the architecture of the system which includes the fusion
module and SVM. The extracted data (audio and visual) are processed by different
similarity functions, d a and d v . The function d a is applied to audio features, whereas
the function d v is applied to visual features. Each function, given the extracted
data, will deliver similarity scores between an input sample and a model vector.
These scores range between zero (accept) and one (reject). In other words, when
combining two modules, the fusion algorithm processes a two-dimensional vector
for which each component is a score in [0, 1] delivered by the corresponding
modality expert. The SVM will combine the opinions of the different experts and
give a binary decision.
Let f a and f v denote the feature vectors extracted from the audio and video
signals, where the subscript a and v are for audio and visual, respectively. Given
the video database, we can obtain a set of samples:
f a )
f v )] ,
x i =[
d a , i (
f a , i ,
d v , i (
f v , i ,
i
=
1
,
2
,...,
N T
(10.17)
where
f a , i
f a
d a , i =
1
(10.18)
f v
f v , i ·
d v , i =
(10.19)
f v
f v , i ×
d i is the function measuring the similarity between the i -th sample f i and f that is
the feature vector of the representative sample from the positive class. d a , i and d v , i
are computed from the audio and visual domain, respectively. From a given video
Search WWH ::




Custom Search