Digital Signal Processing Reference
In-Depth Information
varying environmental conditions. The second problem is to combine the
decisions of individual classifiers so as to enforce the final decision. With the
assumption that each selected feature set is individually discriminative
enough under ideal conditions, the main motivation in a multimodal fusion
scheme is to compensate possible misclassifications of a certain modality
with other available modalities and to end up with a more reliable system.
These misclassifications are in general inevitable due to environmental noise,
measurement errors or time-varying characteristics of the signals. A critical
issue in multimodal fusion is not to deteriorate the performance of unimodal
classifiers. Thus our ultimate goal should be at least not to fail whenever one
of the individual classifiers gives the correct decision. In this work, rather
than selecting the best feature set, the emphasis is on this second problem,
i.e. how to combine the decisions of different classifiers in view of the above
discussion. We claim that the crucial point here is first to assess the
reliability of each classifier, or modality, and then favor the classifiers
according to their reliabilities in an appropriate decision fusion scheme.
Existing multimodal speaker identification systems are mostly bimodal,
integrating audio and face information as in [8, 9, 10, 18], audio and lip
information as in [11, 12, 13, 16, 19] or face and lip shape as in [14]. In [10],
Sanderson et. al. present an audio-visual person verification system that
integrates voice and face modalities and compares concatenative data-fusion
with adaptive and non-adaptive decision fusion techniques, where adaptation
takes into account the acoustic noise level of speech signal. Later in [8],
enhanced PCA for face representation and fusion using SVMs and
confidence measures are presented. Another audio-visual person
identification system proposed in [9] uses a Maximum Likelihood Linear
Transformation (MLLT) based data-fusion technique. These related works
do not address lip-motion as a biometric modality for person identification
and they all do emphasize on the performance of data and decision fusion in
separate. In an eigenface-based person identification system, Kittler et. al use
the lip-shape to classify face images to enhance the face recognition
performance [14].
The only work in the literature that addresses a multimodal speaker
identification system using speech, face and lip motion is the one presented
in [19]. In [19], the information coming from voice, lip-motion and face
modalities are assumed to be independent of each other and thus the
multimodal classification is achieved by a decision fusion mechanism. The
face-only module involves a quite deal of image analysis to normalize and to
Search WWH ::




Custom Search