Database Reference
In-Depth Information
with no prior knowledge. For instance, if a key frame immediately preceding the
current state is missed due to uniform sampling, such information loss could be
compensated by including and summing up distant informational frames (both
previous and future) from uniform sampling without misclassifying the event.
Second, HCRF has merit in its hidden states structure, which helps to relax
the requirement of explicit observed states. This relaxation property is also an
advantage in dealing with large-scale uniformly sampled video frames. It is because
of this configuration, CRF model outputs individual result labels (such as event
or not event) per state and requires separate CRFs to present each possible event
[ 279 ]. In HCRF, only one final result is presented in terms of multi-class events
occurring probabilities. From the point of view of robustness, a CRF model can be
easily ruined by semantically unrelated frames due to automatic uniform sampling.
A multi-class HCRF, on the other hand, can correct the error introduced by such
unrelated frames using probability-based outputs [ 293 ].
Moreover, HCRF is also appealing for allowing the use of not explicitly labeled
training data with partial structure [ 293 ]. From the literature, HCRF has been
successfully used in gesture recognition [ 293 , 294 ] and phone classification [ 295 ].
Figure 9.4 a illustrates an HCRF structure in which label y
Y of event type
is predicted from an input X . This input consists of a sequence of vectors X
=
x 1 ,
x M , with each x m representing a local state observation along
the HCRF structure. In order to predict y from a given input X , a conditional
probabilistic model defined in [ 293 ] and in Eq. ( 9.4 ) is adopted. In the equation,
model parameter
x 2 ,...,
x m ,...,
ʸ
is used to describe the local potential function
ˈ
, which is
expanded in Eq. ( 9.6 ). A sequence of latent variables h
h M are
also introduced in Eq. ( 9.4 ), which are not observable from the structure of Fig. 9.4 a.
Each h m member of h corresponds to a state of s m . The denominator Z
=
h 1 ,
h 2 ,...,
h m ,...,
(
X ;
ʸ )
is the
normalization factor, which is expanded in Eq. ( 9.5 ).
, ʸ )= h e ˈ ( y , h , X )
Z
, ʸ )= h
P
(
y
|
X
P
(
y
,
h
|
X
(9.4)
(
X ;
ʸ )
ʸ )= y , h e ˈ y , h , X
Z
(
X ;
(9.5)
ʸ )= t
k ʸ
k f k (
1
)+ t
k ʸ
k f k (
2
ˈ (
y
,
h
,
X ;
y
,
h t ,
X
y
,
h t 1
,
h t ,
X
)
(9.6)
In the event detection application, each x m from X is a vector descriptor called
local observation. In the notation, the x m value at a time t is defined as x m (
t
)=
[
calculated from
an average result of a sliding window centering at time t ,asFig. 9.5 shows. The
first four entries of x m (
p ws 1 (
t
) ,
p ws 2 (
t
) ,
p ws 3 (
t
) ,
p ws 4 (
t
) ,
p wc (
t
)]
, with each entry of x m (
t
)
t
)
are the probabilities of four possible view types, where
p ws j = 1 , 2 , 3 , 4 (
t
)
associates with close-up-view, mid-view, long-view, and outer-field-
view by j
=
1
,
2
,
3
,
4 respectively. The fifth p wc (
t
)
value is an associated directional
Search WWH ::




Custom Search