JOINT AUDIO-VIDEO PROCESSING FOR ROBUST BIOMETRIC SPEAKER IDENTIFICATION IN CAR - DSP for In-Vehicle and Mobile Systems

Digital Signal Processing Reference

In-Depth Information

2.2

WTAll Decision Fusion

The conventional max rule given by Equation (7) can be modified so as

to better handle possible false identity claims. In this slightly modified

scheme that we will refer to as winner modality takes all (WTAll), the

likelihood ratios in (7) are substituted with confidence measures as defined in

(8):

In this manner, a strong decision for rejection can also be taken into account

and favored even though the corresponding likelihood ratio is not the

maximum of the likelihoods resulting from all available modalities.

3.

FEATURE EXTRACTION

In this section we consider a text-dependent multimodal speaker

identification. The bimodal database consists of audio and video signals

belonging to individuals of a certain population. Each person in this database

utters a predefined secret phrase that may vary from one person to another.

The objective is, given the data of an unknown person, to find whether this

person matches someone in the database or not. The person is identified if

there is a match and is rejected if not. The multimodal system uses three

feature sets extracted from each audio-visual stream that correspond to three

modalities: Face, lip trace and speech. Our goal is at least not to fail

whenever one of the individual classifiers gives the correct decision and also

to be robust against false identity claims. The overall classification is based

on the theoretical framework presented in Section 2.

3.1

Face Modality

The eigenface technique [4], or more generally the principal component

analysis, has proven itself as an effective and powerful tool for recognition of

still faces. The core idea is to reduce the dimensionality of the problem by

obtaining a smaller set of features than the original dataset of intensities. In

Search WWH ::

Custom Search

Home