Database Reference
In-Depth Information
copy detection using high-level descriptions derived from the BoW representation
[ 239 ]; and, person spotting and retrieval based on their faces features in videos
[ 240 ]. In the field of video event analysis, Zhou et al. applied the BoW model
to Gaussian mixture models to represent news videos and utilized kernel-based
supervised learning in classifying news event [ 241 ]. The BoW model was also used
in video clip representation in Xu and Chang's work of video event recognition,
where a multilevel temporal pyramid was adopted to integrate information from
different sub-clips for pyramid match using temporal alignment [ 242 ].
Aforementioned video analysis methods using BoW models have their individual
merits. However, there is a lack of systematic investigation, which is important
in connecting individual aspects of the video analysis, from raw input video clip
genre categorization, to middle level semantic view or shot understanding, to
eventually high-level semantic event analysis. Furthermore, large-scale video data
often contains many hours with a lot of insignificant information. The nature
of large-scale video data is that it requires an automatic and orderly analysis to
obtain efficient information extraction. In this chapter, we propose a BoW model
to represent video frames and clips. We also propose an unsupervised learning
approach to utilize the BoW-based video representation. We manage to tackle a
series of video analysis challenges for unlabeled large-scale video consortia. As a
result, a systematic analysis of video data is achieved.
In order to evaluate the effectiveness of the BoW model in the systematic video
analysis, we need a valid and meaningful test ground. We believe that large-scale
sports videos are ideal. First, sports video is truly a large-scale consortia. It also
contributes significantly to the total collection of digital content. Second, sources
of sports video collection are also various: from daily-basis public recreations to
professional sports games broadcasting; from amateur digital camcorder to pro-
fessional TV broadcasting, and plenteous but low-quality online streamed videos.
Third, sports video analysis is closely connected with real applications, due to its
huge popularity and vast commercial value.
Although analysis of sports video has drawn much attention in the research
community, most of the literature focus on particular sports and tasks, utilizing
domain knowledge and production rules [ 243 - 247 ]. Supervised learning is an
important characteristic adopted by these works to fill the semantic gap. These
stand-alone methods have little inter-connection and also suffer from a lack of
generality and scalability to the large-scale data for two reasons. First, with
various video content of different themes and cinematographic techniques, domain
knowledge associated methods have difficulties in extensibility. Second, labeled
data is required for supervised learning, while the majority of multimedia data
available is currently unlabeled. In order to tackle these two issues, our proposed
algorithm focuses on using a local domain knowledge-independent SIFT feature to
represent video clips using the BoW model and utilizes an unsupervised learning
paradigm to deal with unlabeled large volume data.
In this chapter, a generic and systematic framework is proposed with experi-
mentations on a large-scale sports video dataset. Three tasks are introduced such
that the output from the previous tasks are utilized as the input to the next task.
Search WWH ::




Custom Search