Digital Signal Processing Reference
In-Depth Information
Chapter 16
JOINT AUDIO-VIDEO PROCESSING FOR
ROBUST BIOMETRIC SPEAKER
IDENTIFICATION IN CAR 1
Engin Erzin, Yücel Yemez, A. Murat Tekalp
Multimedia, Vision and Graphics Laboratory, College of Engineering, Koç University,
Istanbul, 34450, TURKEY;
Email: eerzin@ku.edu.tr
Abstract:
In this chapter, we present our recent results on the multilevel Bayesian
decision fusion scheme for multimodal audio-visual speaker identification
problem. The objective is to improve the recognition performance over
conventional decision fusion schemes. The proposed system decomposes the
information existing in a video stream into three components: speech, lip trace
and face texture. Lip trace features are extracted based on 2D-DCT transform
of the successive active lip frames. The mel-frequency cepstral coefficients
(MFCC) of the corresponding speech signal are extracted in parallel to the lip
features. The resulting two parallel and synchronous feature vectors are used to
train and test a two stream Hidden Markov Model (HMM) based identification
system. Face texture images are treated separately in eigenface domain and
integrated to the system through decision-fusion. Reliability based ordering in
multilevel decision fusion is observed to be significantly robust at all SNR
levels.
Keywords:
Speaker identification, multi-modal, multilevel decision fusion, robustness, in-
vehicle
1 This work has been supported by The Scientific and Technical Research Council of Turkey
(TUBITAK) under the project EEEAG-101E038.
Search WWH ::




Custom Search