Database Reference
In-Depth Information
1.3.5.2
Adaptive Video Retrieval with Human in the Loop
While many RF models have been successfully developed for still-image appli-
cations, they have not yet been widely implemented for video databases. This
is because effective content-analysis through RF learning must also capture the
temporal information in a video, and not just spatial information, as required for
a single image. The fundamental requirement is a representation that allows RF
processes to capture sequential information on a video file.
In this topic, the efficiency of TFM-based video indexing in capturing user
perception is demonstrated for video retrieval in the RF process as well as semi-
automatic process. The technology developments along this line are presented in
Chaps. 3 , 7 , 8 , and 10 .
1.3.5.3
Video Retrieval with Pseudo-Relevance Feedback
There is a difficulty faced in the practical application of RF learning in the domain
of video databases. Compared to images, the interactive retrieval of video samples
can be a time-consuming task, since video files are usually large. The user has to
play a sufficient amount of videos to train the retrieval process to make a better
judgment of relevance. Furthermore, on a network-based database, this RF learning
process requires high-bandwidth transmissions during the user interaction.
The video retrieval strategy presented in this topic combines the video indexing
structure based on TFM with pseudo RF to overcome the above challenges. The
integration of TFM with an (unsupervised) adaptive cosine network is presented
to adaptively capture different degrees of visual importance in a video sequence.
This network structure implements an pseudo RF process through its signal
propagation with no user input to achieve higher accuracy in retrieval. Hence, this
technique can avoid the time-consuming task of user-interaction, and allows suitable
implementation of video retrieval on the network-based database. This pseudo-RF
is presented in Chaps. 3 and 8 .
1.3.5.4
Multi-Modal Fusion
Tasks involving the analysis of video content, such as detection of complex events,
are intrinsically multimodal problems, since audio, textual, and visual information
all provide important clues to identify content. The fusion of these modalities offer a
more completed description of video and hence facilitate effective video retrieval.
Chapter 7 explores the multiple modalities in video with the MPEG-7 standard
descriptors. Video segmentation is performed by characterizing events with motion
activity descriptors. Then, the events are classified by multimodal analysis using
motion, audio, and Mel Frequency Cepstrum Coefficients (MFCC) features.
Chapter 10 presents an audio-visual fusion that combines TFM-visual features
with Laplacian-Mixture Model (LMM)-audio features. The multimodal signals
Search WWH ::




Custom Search