Scalable Video Genre Classification and Event Detection - Multimedia Database Retrieval: Technology and Applications

Database Reference

In-Depth Information

as feature descriptors in this work. The SIFT method extracts key-points of an

image and describes these points using local neighborhood regional information.

Since no prior and domain knowledge is required, SIFT is an ideal option in the

large-scale automatic and homogenous process. By processing image sequences

sampled from video clips, each frame is represented by a magnitude of hundreds of

SIFT descriptors. After homogenous local descriptor extraction, the BoW model is

applied, whose effectiveness relies on a robust codebook design. In order to achieve

this resiliency, we propose a two-level bottom-up K-means clustering for codebook

generation. The advantages of the bottom-up structure are efficiency, scalability, and

robustness.

The BoW model is adopted by first synthesizing a representing codebook using

codewords which are exemplars of combining sampled SIFT local descriptors.

Consequently, a video clip is characterized by mapping its SIFT feature points to

a generated codebook; and then, a histogram distribution is obtained. Compared to

the original footage, this compact representation preserves enough information for

differentiation, only using a small size in storage. In addition, random noise can be

suppressed by using this proposed frequency-based histogram representation.

With the large-scale dataset, efficiency and robustness of the codebook formation

have been important concerns for the BoW model. Heuristically, the larger the

codebook size, the better the classification results (with certain saturation limita-

tions) [ 260 , 261 ]. Different codebook sizes have been explored, ranging from several

hundred [ 262 , 263 ] to thousands [ 264 ] to hundreds of thousands [ 260 ]. Since they

all use different datasets, no conclusions have been drawn to make a standard rule.

In this chapter, choices of codebook sizes are based on the empirical studies.

K-means clustering is utilized to generate a codebook by finding and appointing

cluster centers as codeword values. In a large-scale domain, satisfactory perfor-

mance has been reported using a top-down structure for categorization [ 265 ]. In

that work, a two-layer top-down structure is used for sports genre categorization.

At the first-layer, a general codebook (size 800) is generated using single K-means,

in which a query video is only categorized to one of the predefined bigger groups

consisting of several genres. Such a group is determined by those sports sharing

similar semantics. At the second-layer after the membership of the bigger group

is identified, an individual codebook (size 200) for this bigger group is used to

decide the video genre. For instance, judo and boxing are combined into a bigger

group named martial arts , where martial arts is used as the first-layer candidate.

Subsequently, Judo and Boxing are differentiated in the second-layer categorization.

Although good classification accuracy has been reported, efficiency and robustness

are problems for such a method in terms of creating a general codebook using

single K-means clustering. This is because most computation of K-means lies

in calculating the distances between individual points to their cluster centers in

each iteration. A single K-means clustering using large-scale data is heavy in

computation and sometimes inaccurate due to K-means own limitations. Since more

than 3 million high-dimensional SIFT points are used for building the codebook in

our application, one single K-means clustering becomes inefficient.

Multimedia Database Retrieval: Technology and Applications

Search WWH ::

Custom Search

Home