Database Reference
In-Depth Information
10.2.5
Performance of Video Retrieval Using Audio Indexing
The experimental results obtained in this section were conducted on a database
consisting of 24 full-length and mostly recent mainstream Hollywood movies
chosen to represent the more popular films, music videos, and commercials. This
included the Titanic, the Patriot, the Postman, Pakistani music videos and films. All
video files were segmented into 6,000 clips, each of which contained one to three
shots, and was approximately 6 s long.
The feature extraction algorithm explained in Table 10.1 was applied to obtain
the audio feature. A wavelet transform with nine-level decompositions was applied
to the audio signal from each video clip. The coefficients in each high frequency
subband were then characterized by the LMM. The resulting model parameters
and the mean and standard deviation of wavelet coefficients in the low-frequency
subband were used to obtain feature vectors according to Eq. ( 10.12 ). In addition,
as feature components represent different physical quantities and have different
dynamic ranges, the Gaussian normalization technique [ 329 ] was employed to
convert the vector component to [
1, 1].
A total of 25 queries were generated from different high-level query concepts
that included “Fighting”, “Ship Crashing”, “Music Video”, and “Dance Party”.
These concepts were chosen based on the fact that audio information was the
dominant feature in these concepts. Five queries were performed for each concept,
and the retrieval precision was measured from 16 best matches. Table 10.2 shows
the retrieval results obtained by using the audio description for video retrieval, at
7-level and 9-level wavelet decompositions. The results are obtained by averaging
the precisions within the individual query concepts, as well as within the overall
queries. These results clearly indicate the power of audio descriptors in finding the
video clips containing the specified concepts. The retrieval results varied depending
on the characteristic of the query, and the performance was the highest in cases
with dialogues. An average-retrieval precision of 84.2 % was achieved based on
9-level decomposition. This precision value was 6.4 % better than that based on 7-
level decomposition. Further increasing the level of decomposition may improve the
performance, but at the expense of more computational overhead.
The similarity concept is hard to define because of the subjectivity of the matter.
But we can define a notion of similarity based on the concept of the clip. In the case
of music videos, the clips are considered similar if they belong to the same song.
Similarly, in the case of audio clips which contain a dialogue, the clips belonging to
the same movie are taken as similar because they involve similar characters. The
lowest performance is obtained in the case of “Ship Crashing” due to its audio
similarity with the “Sound Effects” class. Both classes are overlapping in meaning
in that they contain similar audio content. Performance is enhanced with the increase
in the number of decomposition levels.
Search WWH ::




Custom Search