Digital Signal Processing Reference
In-Depth Information
capture attention requires finding the value, location, and extent of the most salient
subset of the input visual stimuli. Typically, saliency reveals the probability that a
location is salient and involves the stimulus-driven (i.e., bottom-up) and task-related
(i.e., top-down) components in human vision system. In Sect. 4.3.3 , we will present
two rank learning approaches for visual saliency estimation in video.
4.3.2
Video Annotation with Sequence Multilabeling
In recent years, various supervised learning methods (e.g. support vector machines
[ 8 , 65 ], graphical models [ 41 ] and multi-modality fusion methods [ 27 , 51 ]) are
employed to find out the informative feature patterns to detect concepts within
video data. However, due to the well-known semantic gap, video annotation
methods purely relying on low-level features only couldn't achieve the desirable
performance.
Video data are by nature rich in spatial and temporal context that could be use-
ful to facilitate annotation. Generally speaking, semantic concepts may have spatial
correlations within a shot and temporal consistencies between consecutive shots.
That is, several concepts may co-occur within a shot due to the spatial correlation,
and a concept could be persistent across several neighboring shots due to the tem-
poral consistency. Taking an example in Fig. 4.3 , street and building co-occur in
shot t and shot t + 1 , while car is present in three consecutive shots. Moreover, it is
noted that two distinct concepts may correlate with each other between shots. This
Fig. 4.3 Illustration of video annotation with multilabels. For a shot sequence, concepts present
in neighboring shots exhibit several contextual relationships. Note that here both temporal con-
sistency of a concept and temporal dependency between concepts across neighboring shots are
referred to as temporal correlation
Search WWH ::




Custom Search