Digital Signal Processing Reference
In-Depth Information
with [ 41 ], SVM SML is a kernel-based method where the spatial correlation, the first
and the second order temporal correlations are modeled within a joint kernel. More-
over, multimodal features and temporal dynamics of the low-level feature can be
integrated into SVM SML in the manner of basic kernels.
4.3.2.2
SML Formulation for Video Annotation
x 1
x t
x T
Let x
denote the sequence of input features (i.e. visual/
audio/text features) extracted from a video clip consisting of T shots, where
=(
,...,
,...,
) χ
χ
is the input feature space. The output sequence of multilabels is expressed by
y
y 1
y t
y T
,where y T
=(
,...,
,...,
) κ
ν
.Here
ν
and
κ
are the output spaces
of individual shot and shot sequence, respectively. Let C
)
represent the lexicon of M semantic concepts. Each entry (i.e. the multilabel
of an elementary shot) of the output multilabel sequence can be expressed by
an M
=(
c 1
,...,
c m
,...,
c M
dimensional label vector y t
y t 1 ,...,
y t m ,...,
y t M ) ,where y t m ∈{
=(
1
,
0
}
indicates
whether
concept
c m
is
present
in
the t th
shot.
Accordingly, L
=
{ (
x 1 ,
y 1 ) ,..., (
x i ,
y i ) ,..., (
x N ,
y N ) }
denotes
the
training
set
consisting
of
N
sequences.
Given the training set L , SML aims to learn an optimal mapping from a sequence
of input features to a sequence of output multilabels. For an unknown shot sequence
x , the sequence of output multilabels can be predicted as:
y =(
y 1 ,...,
y t ,...,
y T )
x 1
x t
x T
y 1
y t
y T
=
arg
max
F
(
,...,
,...,
,
,...,
,...,
) ,
(4.1)
y 1
y T
(
,...,
) κ
where F
is SML score function over the input feature sequence and the output
multilabel sequence. SML predicts the annotation of the shot sequence by maximiz-
ing the score function F
( · )
over all candidate multilabel sequences. As shown in
Fig. 4.4 c, different types of spatial and temporal contexts in the shot sequence can
be also incorporated with the prediction.
SML is a generalized formulation for video annotation. That is, IML and IML-T
can be viewed as two special cases of SML. When all video shots are assumed to be
independent with each other, SML reduces to IML:
( · )
y =(
y 1 ,...,
y t ,...,
y T ) ,
where y t =
x t
y t
arg max
y t
F
(
,
) .
(4.2)
ν
In IML, detection of one concept only depends on low-level features and other con-
cepts within current shot.
IML-T is a two-step optimization process which improves the initial annotation
results of IML by:
y ( t w )
y ( t + w )
y ( t 1 )
y t
y ( t + 1 )
y t =
,...,
,
,
,...,
arg max
y t
ν ϕ
where y t
x t
y t
=
arg max
y t
F
(
,
) .
(4.3)
ν
Search WWH ::




Custom Search