Database Reference
In-Depth Information
representation built from the BoW model benefits users in homogenously extracting
visual features and representing videos in a compact and collective form.
In the 1st module, videos are categorized by genre. Video genre nomenclature is
used to describe the video type, which is defined as the highest level of granularity
in video content representation. Since the video genre categorization task directly
relies on low-level features, the proposed feature extraction of the target video
sequence is used in categorization. In large-scale videos, a successful identification
of the genre serves as the first step before attempting higher level tasks. For instance,
in sports event detection, an unknown “shooting” event is the target quest, which
could be from a ball game or a shooting sport. By indiscriminately treating the entire
dataset, this event will be searched through all types of sports. However, since sports
like figure-skating and swimming have no “shooting” at all, the effort to search this
event within those non-relevant sports becomes infeasible. Instead of treating all
data indifferently, a more efficient method is to identify the genre of the query video
first; and then, deploy middle/high-level tasks. As the survey shows in sports video
analysis, most of the related works on view classification and event detection assume
the genre by default. This framework, however, provides a system that automatically
identifies the genre from various types of sports data before further analysis.
In the middle-level and the 2nd module, semantic view types are classified
using an unsupervised PLSA learning method to provide labels for video frames.
View describes an individual video frame by abstracting its overall content. It
is treated as a bridge between low-level visual features and high-level semantic
understanding. In addition, unsupervised learning saves a massive amount of human
effort in processing large-scale data. Moreover, the supervised methods can also be
implemented upon our proposed platform. Therefore, a SVM model is executed as
the baseline for comparison.
Finally in the 3rd module, a structured prediction HCRF model using labeled
inputs is a natural fit for the system to detect semantic events. This choice can be
justified in that a video event occupies various length along the temporal dimension.
Thus, the state event model-based HCRF is suitable to deploy. Less comprehensive
baseline methods, such as the hidden Markov model and the conditional random
field, can also be applied on this platform.
Besides the three-level modules in the white background bounding boxes ,this
framework, illustrated in Fig. 9.1 , also highlights the relationship between our
system and existing literature, which are shown in the dark-gray background
bounding box . Associated Table references are also indicated in each module.
Multimodal features excluding local visual features are also introduced at various
stages by the literature. The Dotted arrows are used to represent these associations.
The solid arrows denote the proposed and implemented techniques in our work.
The dashed arrow represents a knowledge transfer characteristic of the generated
codebooks. In summary, codebooks generated from certain sports with abundant
resources, can be transferred and utilized in classifying other sports materials with
scarce resources.
Search WWH ::




Custom Search