Digital Signal Processing Reference
In-Depth Information
scale [5]. The resulting MFCC features are derived using discrete cosine
transform over log-scaled filter-bank energies
where is the number of mel-scaled filter banks and N is the number of
MFCC features that are extracted. The MFCC feature vector for the k- th
frame is defined as,
The audio feature vector
for the k-
th frame is formed as a collection of MFCC vector
along with the first and
second delta MFCCs,
The gray scale intensity based lip stream is transformed into 2D-DCT
domain and then each lip frame is represented by the first M DCT
coefficients of the zig-zag scan excluding the 0-th dc coefficient. The lip
feature vector for the i -th lip frame is denoted by As the audio features
are extracted at a rate of 100 fps and the lip features are extracted at a rate of
15 fps, rate synchronization should be performed prior to the data fusion.
The lip features are computed using linear interpolation over the
sequence to match the 100 fps rate as follows:
where
and
The unimodal and fused temporal characterizations of the audio and the
lip modalities are performed using Hidden Markov Models, which are
reliable structures to model human hearing system, and thus they are widely
used for speech recognition and speaker identification problems [2]. In this
work a word-level continuous-density HMM structure is built for the speaker
identification task. Each speaker in the database population is modeled using
a separate HMM and is represented with the feature sequence that is
extracted over the audio/lip stream while uttering the secret phrase. First a
world HMM model is trained over the whole training data of the population.
Then each HMM associated to a speaker is trained over some repetitions of
the audio-video utterance of the corresponding speaker. In the identification
process, given a test feature set, each HMM structure produces likelihood.
Search WWH ::




Custom Search