Database Reference
In-Depth Information
end, this chapter presents the application of perception-based features extracted
from different modalities and fused through a machine learning process in order
to retrieve and classify relevant movie clips.
Video content analysis tasks, such as the detection of complex events, are
intrinsically multimodal problems, since both the audio and visual channels provide
important clues. Recognition of video entities such as events in the visual domain
alone is challenging enough since a video contains large variations in lighting,
viewpoint, camera motion, etc. However, video also contains audio information
which provide an extra useful clue for content analysis. The video content captured
is multimodal, and the task of video content analysis requires a fusion model to
capture both consistent and inconsistent audio-visual patterns for video indexing
and retrieval.
In Sect. 10.2 , we begin with the method for audio content analysis and indexing.
The modeling scheme based on the Laplacian mixture model (LMM) is presented
and demonstrated for indexing and retrieval of videos using audio content. The
LMM is utilized to capture the peaky distribution of wavelet coefficients. The
LMM's parameters provide a low-dimension feature vector for video indexing, as
well as an efficient audio feature that is helpful for finding clues to the video events.
Section 10.3 presents the application of the template-frequency modeling (TFM)
method for visual content characterization of movie clips, and experimentally
explores its efficiency and robustness. The TFM performance is also compared to
the single/multi frame-based video indexing, which performs frame clustering for
video indexing. A movie search engine is developed which addresses the difficulty
in video retrieval with automatic and semi-automatic relevance feedback.
While the previous sections explain the extraction methods for perception-
based features, Sect. 10.4 presents a learning algorithm for audio-visual fusion
and demonstrates its application for video classification in a movie database. The
perception-based features are extracted from different modalities and fused through
a machine learning process. In order to capture the spatial-temporal information,
TFM is applied to extract visual features, and LMM is utilized to extract audio
features. These features are fused at a late fusion stage and input to a support
vector machine (SVM) to construct a decision function for the classification of
videos according to a given concept. The experimental results show that the system
implementing this fusion method successfully attained high classification accuracy
when applied to a large database containing various types of Hollywood movies.
10.2
Audio Content Characterization
The users of the video data are often interested in certain action sequences that
are easier to identify in the audio domain. Audio is effective in linking visually
different but semantically related video clips. In this section, a statistical approach
is adopted to analyze the audio data and extracts audio features for video indexing.
Wavelet transformation is applied to the audio signal and the LMM is utilized for
characterization of audio content.
Search WWH ::




Custom Search