Digital Signal Processing Reference
In-Depth Information
from the population. The audio-visual data MVGL-AVD have been acquired
using a Sony DSR-PD150P video camera at Multimedia Vision and Graphics
Laboratory of Koç University. A collection of sample images from the
audio-visual database is presented in Figure 16-2.
Equal error rate (EER), where false accept rate (FAR) equals false reject
rate (FRR) operating point, is used in the performance analysis of speaker
identification system. False accepts occur when an imposter is identified as
an accepted client or when a client from the accept database identified
incorrectly. False rejects occur when a client from the accept database is
rejected. The false accept and the false reject rates are computed as,
where and are the total number of trials in the accept and reject
scenario, respectively.
The temporal characterization of audio, lip and audio-lip fused streams
have been obtained using a 6-state left-to-right two-mixture continuous
density HMM structure for each speaker. The acquired video data is first
split into segments of secret phrase utterances. The visual and audio streams
are then separated into two parallel streams, where the visual stream has
gray-level video frames of size 720×576 pixels containing the frontal view of
a speaker's head at a rate of 15 fps and the audio stream has 16 kHz sampling
rate. The acoustic noise, which is added to the speech signal to observe the
identification performance under adverse conditions, is picked to be either
car noise or a mixture of office and babble noise.
The audio stream processing is done over 10 ms frames centered on 25
ms Hamming window. The MFCC feature vector, is formed from 13
cepstral coefficients excluding the gain coefficient using 26 mel-
frequency bins. The resulting audio feature vector, of size 39, includes
the MFCC vector and the first and the second delta MFCC vectors. Two
variations of the audio feature vector are defined based on the frequency
selectiveness in MFCC calculation. The first mel-band that calculates the
first energy term (see Eq. 12) is picked to start at 50 Hz and at 250 Hz,
where these features are called MFCC and high-pass MFCC (MFCC+HP),
respectively. In the audio-only scenario the identification performance
degrades rapidly with decreasing SNR as seen from Table 16-1. The high-
pass MFCC features are observed to be more robust under environmental
Search WWH ::




Custom Search