Scalable Video Genre Classification and Event Detection - Multimedia Database Retrieval: Technology and Applications

Database Reference

In-Depth Information

extraction methods, Kolekar and Palaniappan [ 285 ] took a top-down approach. They

first used audio features to find exciting video clip. The motion features of the whole

image volume along with the background color information are then utilized for

view-type classification. Benmokhtar et al. [ 286 ] took an approach on feature level

fusion using dynamic PCA with information coding neural-network (NN). At the

classification level, another NN is used to fuse multi-modality inputs. However,

these supervised methods are limited by the labeled data; and thus, constrained from

being expanded to larger scales.

Some other researchers pursued unsupervised methods for view classification.

Wang et al. [ 280 ] proposed an information-theoretic co-clustering method, in

which mutual information was maximized by treating shot classes and features as

two random variables. As a consequence, color histogram and perceived motion

energy features are used with a test set of four sports video genres. Zhong et al.'s

method was inspired from spectral theory conventionally used to solve segmentation

problem in graph theory [ 281 ]. They proposed a spectral-division algorithm to find

the proper video shot clustering, which were tested in three sports videos using

the HSV space color feature. Although good performances have been obtained in

these methods, the extensibility and flexibility towards diverse genres and large-

scale datasets are very limited. This limitation is again due to the domain knowledge

dependency of the extracted features.

Table 9.3 compares the aforementioned methodologies from angles of feature

utilization and classification techniques. Color and texture are two major global

features used by most works. Duan et al.'s work is the only one that proposed

middle level features developed from low-level global features. The rest of the work

either adopted additional popular global feature schemes, such as audio feature or

Gabor feature, as well as some production rule-based features, or did not utilize any.

While various global features are used, none of the local features have been applied.

Moreover, most of the supervised methods (except Duan et al.'s work) focus on a

single (soccer) sport, while unsupervised techniques use various types of sports.

9.3.2.2

Unsupervised View Classification

This section introduces the middle-level view classification, where the previously

built BoW model is also used as feature representation. Since this work targets large-

scale videos, an unsupervised solution is more viable and applicable. Therefore,

we chose to use unsupervised probabilistic latent semantic analysis (PLSA)-based

models. PLSA has demonstrated promising results in analyzing co-occurrence data

of words and documents in text retrieval [ 287 ]. From a matrix factorization point of

view, PLSA belongs to a subgroup called non-negative matrix factorization, where

the factorized matrices are non-negative [ 288 ]. Because the codebook paradigm

with codewords is adopted in mapping visual features to a probability-based

histogram which has to be non-negative, PLSA becomes a more suitable selection

compared to other factorization techniques, such as singular value decomposition or

principle component analysis.

Multimedia Database Retrieval: Technology and Applications

Search WWH ::

Custom Search

Home