Digital Signal Processing Reference
In-Depth Information
of features. Recently, Qi et al. [ 48 ] employ a multilabel classifier to build up a cor-
relative multilabeling (CML) framework. CML employs the concept correlations
within individual shots and annotates shots with multiple concepts simultaneously.
Overall, these works detect concepts within individual shots independently. In a
sense, they can be considered as a sort of direct extension of image annotation
methods to video domain, with much less temporal context taken into account.
As video is temporally informative, researches attempt to utilize temporal infor-
mation to enhance video annotation. Generally speaking, video annotation methods
over sequential shots can be categorized into three types. The first type is to model
the temporal pattern of low-level features. For example, hidden Markov model
(HMM) is employed by Xie et al. [ 59 ] to model the temporal dynamics of low-level
features (e.g. color and motion) for specific video event detections. Qi et al. [ 48 ]
introduce a temporal kernel into CML to model the similarity between sequences
of low-level feature. In these works, the temporal dynamic in low-level feature is
employed to improve specific concept detectors whereas higher-level temporal cor-
relations of concepts are ignored.
The second type is to perform temporal refinement over IML (IML-T), in which
concepts are first annotated over individual shots followed by the refinement with
temporal consistency (as shown in Fig. 4.4 b). For example, Yang et al. [ 64 ]andLiu
et al. [ 35 ] incorporate temporal consistency into active learning to detect multiple
video concepts. Weng et al. [ 58 ]andLiuetal.[ 36 ] propose several fusion methods
to refine the annotation results of individual shots, where spatial correlations and
temporal consistencies of concepts are modeled by association rules and temporal
filtering respectively. Higher-order temporal consistency of concept is also explored
in [ 58 ]. In these methods, the outputs of each concept detector across consecutive
shots are smoothed to keep temporal consistency. However, the pairwise interactions
of distinct concepts across consecutive shots are ignored basically. Despite more or
less improvements, these methods are weak for unstable performance.
The third type is to model the spatial and temporal context of concepts tempo-
rally. Besides the dynamics of low-level features, spatial and temporal contexts of
higher-level concepts would be useful to assist event/action detection [ 9 , 57 ]. How-
ever, there are few generic approaches in enhancing video annotation with spatial
and temporal correlations of concepts. Naphade et al. [ 41 ] try to integrate spatial
co-occurrence and temporal dependency of concepts into a probabilistic Bayesian
network so that the pair-wise relationships of concepts from one frame (or shot) and
between two adjacent frames (or shots) can be modeled. Alternatively, in this sec-
tion, video annotation is formulated as a sort of sequence multilabeling and solved
with a unified learning framework to capture both spatial and temporal correla-
tions of semantic concepts (as shown in Fig. 4.4 c). Compared with IML-T methods,
SVM SML learns both SML score function and the contributions of multiple cues
(i.e., distinct low-level features, spatial and temporal correlations of concept labels)
in one single stage over the same training dataset. SVM SML do not require any initial
annotation and has greatly alleviated the problem of error propagation. Also learn-
ing the SML score function as well as spatial and temporal context over the same
training data avoids additional efforts on data collection and labeling. Compared
Search WWH ::




Custom Search