Video Scene Analysis: A Machine Learning Perspective - Video Segmentation and Its Applications

Digital Signal Processing Reference

In-Depth Information

stage, precise video contents associated to the semantic event have been detected

in terms of the event boundary detection and accuracy analysis. For example, Web-

casting text for coarse stage event detection and video alignment was studied and

analyzed such as replay scenes and various goal and shot scenes detection in soccer

video [ 7 , 61 ].

Since the proposed framework targets on the generic learning model that can

be extended to large-scale, we propose a HCRF based structured prediction model

utilizing previously classified views, and completing the generic approach. For ex-

ample, the HCRF model can be used to detect the score event in basketball for

exciting events and highlights. Such a HCRF technique belongs to the state event

model defined in the related works. Therefore, the HCRF takes the labeled se-

quences as input in a natural and seamless fashion. On the other hand, the HCRF is

a comprehensive model, which can be degraded to hidden Markov models (HMM)

or conditional random fields (CRF) with certain constraints. The merits of HCRF

comparing the other two models are its resilience and robustness with combination

of both the hidden states and the Markov property relaxation.

There are several advantages of using the HCRF in large-scale datasets than

HMM or CRF models. Firstly, HCRF relaxes the Markov property, which assumes

that the future state only depends on the current state. In our generic framework,

video frames are uniformly decimated and sampled, regardless of the temporal pace

of video itself. In some cases, several consecutive frames have the same labeling

while in other cases, different labels are assigned. Markov property based model

such as HMM is appropriate for the former scenarios but not suitable for the latter

ones, since the future state in HMM only cares about the current state label but not

previous states. On the other hand, HCRF is flexible and takes surrounding states

from both before and after the current state. Thus, HCRF is more robust for dealing

with large-scale homogeneous process and uniform sampling with no prior knowl-

edge. For instance, if a key frame immediate preceding the current stage is missed

due to the uniform sampling. such an information loss could be compensated by

including and summing up previous or later information without misclassifying the

event. Secondly, HCRF has merit in its hidden states structure, which helps to re-

lax the requirement of explicit observed states. This is also an advantage in dealing

large-scale uniformly sampled video frames. It is because that in computation, the

CRF model outputs individual result label (such as event or not event) per state and

requires separate CRFs to present each possible event [ 62 ]. In HCRF, only one fi-

nal result is presented in terms of multi-class events occurring probabilities. From

the robustness point of view, a CRF model can be easily ruined by semantically

unrelated frames due to the automatic uniform sampling. A multiclass HCRF on

the other hand, can correct the error introduced by such unrelated frames using

probability-based outputs [ 49 ]. Moreover, the HCRF is also appealing in allowing

the use of not explicitly labeled training data with partial structure [ 49 ]. From lit-

erature, HCRF has been successfully used in gesture recognition [ 49 ] and phone

classification [ 12 ].

Figure 4.12 a illustrates a HCRF structure, in which a label y

Y of event

type is predicted from an input X . This input consists of a sequence of vectors

∈

Video Segmentation and Its Applications

Search WWH ::

Custom Search

Home