Audio-Visual Fusion for Film Database Retrieval and Classification - Multimedia Database Retrieval: Technology and Applications

Database Reference

In-Depth Information

10.2.4

Feature Extraction from Audio Signal

In order to obtain feature extraction, an audio signal is separated from the input

video clip. The audio signal is then re-sampled to a uniform sampling rate. It is

decomposed using a one-dimensional DWT. This decomposes audio signals into

two subbands at each wavelet scale; a low-frequency subband and a high-frequency

subband. The audio signal is different from a gray or color image signal. In images,

the values of adjacent pixels usually do not change sharply. On the other hand,

the digital audio signal is a form of oscillating waveform, which includes a variety

of frequency components varying with time. The low-frequency subband of the

image is a low-resolution approximation of the original image. However, most audio

signals consist of a wide variety of frequencies. The wavelet coefficients of the

audio signal have many large values in detail levels, and the low-frequency subband

coefficients do not always provide a good approximation of the original signal.

The wavelet decomposition scheme matches the models of sound octave-divisions

for perceptual scales. Wavelet transform also provides a multiscale representation

of sound information, so that we can build an indexing structure based on this

scale property. Moreover, audio signals are nonstationary signals whose frequency

contents evolve with time. Wavelet transform provides both frequency and time

information simultaneously. These properties of wavelet transform for sound signal

decomposition form the foundation of audio indexing.

The distribution of the wavelet coefficients in high-frequency subbands are

modeled by a mixture of two Laplacians centered at 0. The parameters of this

mixture model are used as features for indexing. The model can be represented as:

w ( i ) )= ʱ 1 p 1 (

w ( i ) |

(

b 1 )+ ʱ 2 p 2 (

b 2 )

(10.10)

ʱ 1 + ʱ 2 =

(10.11)

where

ʱ 2 are the mixing probabilities of the two components p 1 and p 2 ,

respectively. w ( i ) are the wavelet coefficients while b 1 and b 2 are the parameters of

the Laplacian distributions p 1 and p 2 , respectively.

Table 10.1 summarises the feature extraction algorithm that employs the EM

algorithm to obtain the model parameters. In practice, the wavelet decomposition of

the audio signals is taken up to L levels. The feature vector used for indexing the

video clips consists of the following components:

f a = {

ʱ 1 and

b 2 , l } ,

m 0 , ˃ 0 },{ ʱ 1 , l ,

b 1 , l ,

,...,

−

(10.12)

where f a denotes the feature vector describing the audio content. This composes of

the mean and standard deviation of the wavelet coefficients in the low-frequency

subband; model parameters calculated for each of the high-frequency subbands.

Multimedia Database Retrieval: Technology and Applications

Search WWH ::

Custom Search

Home