Scalable Video Genre Classification and Event Detection - Multimedia Database Retrieval: Technology and Applications

Database Reference

In-Depth Information

representation built from the BoW model benefits users in homogenously extracting

visual features and representing videos in a compact and collective form.

In the 1st module, videos are categorized by genre. Video genre nomenclature is

used to describe the video type, which is defined as the highest level of granularity

in video content representation. Since the video genre categorization task directly

relies on low-level features, the proposed feature extraction of the target video

sequence is used in categorization. In large-scale videos, a successful identification

of the genre serves as the first step before attempting higher level tasks. For instance,

in sports event detection, an unknown “shooting” event is the target quest, which

could be from a ball game or a shooting sport. By indiscriminately treating the entire

dataset, this event will be searched through all types of sports. However, since sports

like figure-skating and swimming have no “shooting” at all, the effort to search this

event within those non-relevant sports becomes infeasible. Instead of treating all

data indifferently, a more efficient method is to identify the genre of the query video

first; and then, deploy middle/high-level tasks. As the survey shows in sports video

analysis, most of the related works on view classification and event detection assume

the genre by default. This framework, however, provides a system that automatically

identifies the genre from various types of sports data before further analysis.

In the middle-level and the 2nd module, semantic view types are classified

using an unsupervised PLSA learning method to provide labels for video frames.

View describes an individual video frame by abstracting its overall content. It

is treated as a bridge between low-level visual features and high-level semantic

understanding. In addition, unsupervised learning saves a massive amount of human

effort in processing large-scale data. Moreover, the supervised methods can also be

implemented upon our proposed platform. Therefore, a SVM model is executed as

the baseline for comparison.

Finally in the 3rd module, a structured prediction HCRF model using labeled

inputs is a natural fit for the system to detect semantic events. This choice can be

justified in that a video event occupies various length along the temporal dimension.

Thus, the state event model-based HCRF is suitable to deploy. Less comprehensive

baseline methods, such as the hidden Markov model and the conditional random

field, can also be applied on this platform.

Besides the three-level modules in the white background bounding boxes ,this

framework, illustrated in Fig. 9.1 , also highlights the relationship between our

system and existing literature, which are shown in the dark-gray background

bounding box . Associated Table references are also indicated in each module.

Multimodal features excluding local visual features are also introduced at various

stages by the literature. The Dotted arrows are used to represent these associations.

The solid arrows denote the proposed and implemented techniques in our work.

The dashed arrow represents a knowledge transfer characteristic of the generated

codebooks. In summary, codebooks generated from certain sports with abundant

resources, can be transferred and utilized in classifying other sports materials with

scarce resources.

Multimedia Database Retrieval: Technology and Applications

Search WWH ::

Custom Search

Home