Database Reference
In-Depth Information
as feature descriptors in this work. The SIFT method extracts key-points of an
image and describes these points using local neighborhood regional information.
Since no prior and domain knowledge is required, SIFT is an ideal option in the
large-scale automatic and homogenous process. By processing image sequences
sampled from video clips, each frame is represented by a magnitude of hundreds of
SIFT descriptors. After homogenous local descriptor extraction, the BoW model is
applied, whose effectiveness relies on a robust codebook design. In order to achieve
this resiliency, we propose a two-level bottom-up K-means clustering for codebook
generation. The advantages of the bottom-up structure are efficiency, scalability, and
robustness.
The BoW model is adopted by first synthesizing a representing codebook using
codewords which are exemplars of combining sampled SIFT local descriptors.
Consequently, a video clip is characterized by mapping its SIFT feature points to
a generated codebook; and then, a histogram distribution is obtained. Compared to
the original footage, this compact representation preserves enough information for
differentiation, only using a small size in storage. In addition, random noise can be
suppressed by using this proposed frequency-based histogram representation.
With the large-scale dataset, efficiency and robustness of the codebook formation
have been important concerns for the BoW model. Heuristically, the larger the
codebook size, the better the classification results (with certain saturation limita-
tions) [ 260 , 261 ]. Different codebook sizes have been explored, ranging from several
hundred [ 262 , 263 ] to thousands [ 264 ] to hundreds of thousands [ 260 ]. Since they
all use different datasets, no conclusions have been drawn to make a standard rule.
In this chapter, choices of codebook sizes are based on the empirical studies.
K-means clustering is utilized to generate a codebook by finding and appointing
cluster centers as codeword values. In a large-scale domain, satisfactory perfor-
mance has been reported using a top-down structure for categorization [ 265 ]. In
that work, a two-layer top-down structure is used for sports genre categorization.
At the first-layer, a general codebook (size 800) is generated using single K-means,
in which a query video is only categorized to one of the predefined bigger groups
consisting of several genres. Such a group is determined by those sports sharing
similar semantics. At the second-layer after the membership of the bigger group
is identified, an individual codebook (size 200) for this bigger group is used to
decide the video genre. For instance, judo and boxing are combined into a bigger
group named martial arts , where martial arts is used as the first-layer candidate.
Subsequently, Judo and Boxing are differentiated in the second-layer categorization.
Although good classification accuracy has been reported, efficiency and robustness
are problems for such a method in terms of creating a general codebook using
single K-means clustering. This is because most computation of K-means lies
in calculating the distances between individual points to their cluster centers in
each iteration. A single K-means clustering using large-scale data is heavy in
computation and sometimes inaccurate due to K-means own limitations. Since more
than 3 million high-dimensional SIFT points are used for building the codebook in
our application, one single K-means clustering becomes inefficient.
Search WWH ::




Custom Search