Database Reference
In-Depth Information
10.2.4
Feature Extraction from Audio Signal
In order to obtain feature extraction, an audio signal is separated from the input
video clip. The audio signal is then re-sampled to a uniform sampling rate. It is
decomposed using a one-dimensional DWT. This decomposes audio signals into
two subbands at each wavelet scale; a low-frequency subband and a high-frequency
subband. The audio signal is different from a gray or color image signal. In images,
the values of adjacent pixels usually do not change sharply. On the other hand,
the digital audio signal is a form of oscillating waveform, which includes a variety
of frequency components varying with time. The low-frequency subband of the
image is a low-resolution approximation of the original image. However, most audio
signals consist of a wide variety of frequencies. The wavelet coefficients of the
audio signal have many large values in detail levels, and the low-frequency subband
coefficients do not always provide a good approximation of the original signal.
The wavelet decomposition scheme matches the models of sound octave-divisions
for perceptual scales. Wavelet transform also provides a multiscale representation
of sound information, so that we can build an indexing structure based on this
scale property. Moreover, audio signals are nonstationary signals whose frequency
contents evolve with time. Wavelet transform provides both frequency and time
information simultaneously. These properties of wavelet transform for sound signal
decomposition form the foundation of audio indexing.
The distribution of the wavelet coefficients in high-frequency subbands are
modeled by a mixture of two Laplacians centered at 0. The parameters of this
mixture model are used as features for indexing. The model can be represented as:
w ( i ) )= ʱ 1 p 1 (
w ( i ) |
w ( i ) |
p
(
b 1 )+ ʱ 2 p 2 (
b 2 )
(10.10)
ʱ 1 + ʱ 2 =
1
(10.11)
where
ʱ 2 are the mixing probabilities of the two components p 1 and p 2 ,
respectively. w ( i ) are the wavelet coefficients while b 1 and b 2 are the parameters of
the Laplacian distributions p 1 and p 2 , respectively.
Table 10.1 summarises the feature extraction algorithm that employs the EM
algorithm to obtain the model parameters. In practice, the wavelet decomposition of
the audio signals is taken up to L levels. The feature vector used for indexing the
video clips consists of the following components:
f a = {
ʱ 1 and
b 2 , l } ,
m 0 , ˃ 0 },{ ʱ 1 , l ,
b 1 , l ,
l
=
1
,
2
,...,
L
1
(10.12)
where f a denotes the feature vector describing the audio content. This composes of
the mean and standard deviation of the wavelet coefficients in the low-frequency
subband; model parameters calculated for each of the high-frequency subbands.
Search WWH ::




Custom Search