Video Scene Analysis: A Machine Learning Perspective - Video Segmentation and Its Applications

Digital Signal Processing Reference

In-Depth Information

with [ 41 ], SVM SML is a kernel-based method where the spatial correlation, the first

and the second order temporal correlations are modeled within a joint kernel. More-

over, multimodal features and temporal dynamics of the low-level feature can be

integrated into SVM SML in the manner of basic kernels.

4.3.2.2

SML Formulation for Video Annotation

x 1

x t

x T

Let x

denote the sequence of input features (i.e. visual/

audio/text features) extracted from a video clip consisting of T shots, where

,...,

) ∈ χ

is the input feature space. The output sequence of multilabels is expressed by

y 1

y t

y T

,where y T

,...,

) ∈ κ

∈ ν

.Here

and

are the output spaces

of individual shot and shot sequence, respectively. Let C

)

represent the lexicon of M semantic concepts. Each entry (i.e. the multilabel

of an elementary shot) of the output multilabel sequence can be expressed by

an M

c 1

,...,

c m

,...,

c M

dimensional label vector y t

y t 1 ,...,

y t m ,...,

y t M ) ,where y t m ∈{

}

indicates

whether

concept

c m

present

the t th

shot.

Accordingly, L

{ (

x 1 ,

y 1 ) ,..., (

x i ,

y i ) ,..., (

x N ,

y N ) }

denotes

the

training

set

consisting

sequences.

Given the training set L , SML aims to learn an optimal mapping from a sequence

of input features to a sequence of output multilabels. For an unknown shot sequence

x , the sequence of output multilabels can be predicted as:

y ∗ =(

y 1 ∗ ,...,

y t ∗ ,...,

y T ∗ )

x 1

x t

x T

y 1

y t

y T

arg

max

(

,...,

) ,

(4.1)

y 1

y T

(

,...,

) ∈ κ

where F

is SML score function over the input feature sequence and the output

multilabel sequence. SML predicts the annotation of the shot sequence by maximiz-

ing the score function F

( · )

over all candidate multilabel sequences. As shown in

Fig. 4.4 c, different types of spatial and temporal contexts in the shot sequence can

be also incorporated with the prediction.

SML is a generalized formulation for video annotation. That is, IML and IML-T

can be viewed as two special cases of SML. When all video shots are assumed to be

independent with each other, SML reduces to IML:

( · )

y ∗ =(

y 1 ∗ ,...,

y t ∗ ,...,

y T ∗ ) ,

where y t ∗ =

x t

y t

arg max

y t

(

) .

(4.2)

∈ ν

In IML, detection of one concept only depends on low-level features and other con-

cepts within current shot.

IML-T is a two-step optimization process which improves the initial annotation

results of IML by:

y ( t − w )

y ( t + w )

y ( t − 1 )

y t

y ( t + 1 )

y t ∗ =

,...,

arg max

y t

∈ ν ϕ

where y t

x t

y t

arg max

y t

(

) .

(4.3)

∈ ν

Video Segmentation and Its Applications

Search WWH ::

Custom Search

Home