Scalable Video Genre Classification and Event Detection - Multimedia Database Retrieval: Technology and Applications

Database Reference

In-Depth Information

with no prior knowledge. For instance, if a key frame immediately preceding the

current state is missed due to uniform sampling, such information loss could be

compensated by including and summing up distant informational frames (both

previous and future) from uniform sampling without misclassifying the event.

Second, HCRF has merit in its hidden states structure, which helps to relax

the requirement of explicit observed states. This relaxation property is also an

advantage in dealing with large-scale uniformly sampled video frames. It is because

of this configuration, CRF model outputs individual result labels (such as event

or not event) per state and requires separate CRFs to present each possible event

[ 279 ]. In HCRF, only one final result is presented in terms of multi-class events

occurring probabilities. From the point of view of robustness, a CRF model can be

easily ruined by semantically unrelated frames due to automatic uniform sampling.

A multi-class HCRF, on the other hand, can correct the error introduced by such

unrelated frames using probability-based outputs [ 293 ].

Moreover, HCRF is also appealing for allowing the use of not explicitly labeled

training data with partial structure [ 293 ]. From the literature, HCRF has been

successfully used in gesture recognition [ 293 , 294 ] and phone classification [ 295 ].

Figure 9.4 a illustrates an HCRF structure in which label y

Y of event type

is predicted from an input X . This input consists of a sequence of vectors X

∈

x 1 ,

x M , with each x m representing a local state observation along

the HCRF structure. In order to predict y from a given input X , a conditional

probabilistic model defined in [ 293 ] and in Eq. ( 9.4 ) is adopted. In the equation,

model parameter

x 2 ,...,

x m ,...,

is used to describe the local potential function

, which is

expanded in Eq. ( 9.6 ). A sequence of latent variables h

h M are

also introduced in Eq. ( 9.4 ), which are not observable from the structure of Fig. 9.4 a.

Each h m member of h corresponds to a state of s m . The denominator Z

h 1 ,

h 2 ,...,

h m ,...,

(

X ;

ʸ )

is the

normalization factor, which is expanded in Eq. ( 9.5 ).

, ʸ )= ∑ h e ˈ ( y , h , X ;ʸ )

, ʸ )= h

(

(9.4)

(

X ;

ʸ )

ʸ )= y , h e ˈ y , h , X ;ʸ

(

X ;

(9.5)

ʸ )= ∑ t

∑ k ʸ

k f k (

)+ ∑ t

∑ k ʸ

k f k (

ˈ (

X ;

h t ,

h t − 1

h t ,

)

(9.6)

In the event detection application, each x m from X is a vector descriptor called

local observation. In the notation, the x m value at a time t is defined as x m (

[

calculated from

an average result of a sliding window centering at time t ,asFig. 9.5 shows. The

first four entries of x m (

p ws 1 (

) ,

p ws 2 (

) ,

p ws 3 (

) ,

p ws 4 (

) ,

p wc (

)]

, with each entry of x m (

)

are the probabilities of four possible view types, where

p ws j = 1 , 2 , 3 , 4 (

)

associates with close-up-view, mid-view, long-view, and outer-field-

view by j

4 respectively. The fifth p wc (

)

value is an associated directional

Multimedia Database Retrieval: Technology and Applications

Search WWH ::

Custom Search

Home