Audio-Visual Fusion for Film Database Retrieval and Classification - Multimedia Database Retrieval: Technology and Applications

Database Reference

In-Depth Information

end, this chapter presents the application of perception-based features extracted

from different modalities and fused through a machine learning process in order

to retrieve and classify relevant movie clips.

Video content analysis tasks, such as the detection of complex events, are

intrinsically multimodal problems, since both the audio and visual channels provide

important clues. Recognition of video entities such as events in the visual domain

alone is challenging enough since a video contains large variations in lighting,

viewpoint, camera motion, etc. However, video also contains audio information

which provide an extra useful clue for content analysis. The video content captured

is multimodal, and the task of video content analysis requires a fusion model to

capture both consistent and inconsistent audio-visual patterns for video indexing

and retrieval.

In Sect. 10.2 , we begin with the method for audio content analysis and indexing.

The modeling scheme based on the Laplacian mixture model (LMM) is presented

and demonstrated for indexing and retrieval of videos using audio content. The

LMM is utilized to capture the peaky distribution of wavelet coefficients. The

LMM's parameters provide a low-dimension feature vector for video indexing, as

well as an efficient audio feature that is helpful for finding clues to the video events.

Section 10.3 presents the application of the template-frequency modeling (TFM)

method for visual content characterization of movie clips, and experimentally

explores its efficiency and robustness. The TFM performance is also compared to

the single/multi frame-based video indexing, which performs frame clustering for

video indexing. A movie search engine is developed which addresses the difficulty

in video retrieval with automatic and semi-automatic relevance feedback.

While the previous sections explain the extraction methods for perception-

based features, Sect. 10.4 presents a learning algorithm for audio-visual fusion

and demonstrates its application for video classification in a movie database. The

perception-based features are extracted from different modalities and fused through

a machine learning process. In order to capture the spatial-temporal information,

TFM is applied to extract visual features, and LMM is utilized to extract audio

features. These features are fused at a late fusion stage and input to a support

vector machine (SVM) to construct a decision function for the classification of

videos according to a given concept. The experimental results show that the system

implementing this fusion method successfully attained high classification accuracy

when applied to a large database containing various types of Hollywood movies.

10.2

Audio Content Characterization

The users of the video data are often interested in certain action sequences that

are easier to identify in the audio domain. Audio is effective in linking visually

different but semantically related video clips. In this section, a statistical approach

is adopted to analyze the audio data and extracts audio features for video indexing.

Wavelet transformation is applied to the audio signal and the LMM is utilized for

characterization of audio content.

Multimedia Database Retrieval: Technology and Applications

Search WWH ::

Custom Search

Home