Digital Signal Processing Reference
In-Depth Information
graph and the less-visited nodes (pixels) were selected as “salient”. Using spectrum
analysis, Hou and Zhang [ 16 ] computed image saliency by extracting spectral resid-
uals in the amplitude spectrum of Fourier Transform, and Guo et al. [ 13 ] modeled
video saliency as a sort of inconsistency in the phase spectrum of Quaternion
Fourier Transform. Alternatively, Marat et al. [ 38 ] presented a biology-inspired
spatiotemporal saliency model. The model extracted two signals from video stream
corresponding to the two main outputs of the retina. Both signals were then trans-
formed and fused into a spatiotemporal saliency map.
From the phenomenological perspective, the bottom-up approaches estimate
visual saliency mainly based on the visual stimuli. However, the task which in-
volves an act of “will” on the probable salient targets also plays an important role.
Biological evidences show that the neurons linked with various stimuli undergo a
mutual competition to generate the bottom-up saliency, while the task can bias such
competition in favor of a specific category of stimuli [ 10 ]. For example, Peters and
Itti [ 46 ] showed that when performing different tasks in video games, individual's
attention could be predicted by respective task-relevant models. In these processes,
the adopted tasks worked as different top-down controls to modulate the bottom-
up process. In real-world scenes, however, it is difficult to explicitly predefine such
tasks. Instead, some approaches such as [ 18 ]and[ 32 ] treated the top-down control
as the priors to segment images before the bottom-up saliency estimation. In their
works, an image was first partitioned into regions, and the regional saliency was
then estimated by regional difference. Other works, such as [ 6 ]and[ 37 ], introduced
the top-down factors into the classical bottom-up framework by extracting semantic
clues (e.g., face, speech and music, camera motion). These approaches could pro-
vide impressive results but relied on the performance of image segmentation and
semantic clue extraction.
Recently, machine learning approaches have been introduced in modeling visual
saliency to learn the top-down control from recorded eye-fixations or labeled salient
regions. Typically, the top-down control works as a “stimulus-saliency” function
to select, re-weight and integrate the input visual stimuli. For example, Itti and
Koch [ 23 ] proposed a supervised approach to learn the optimal weights for feature
combination, while Peters and Itti [ 47 ] presented an approach to learn the projection
matrix between global scene characteristics and eye density maps. Navalpakkam
and Itti [ 44 ] modeled the top-down gain optimization as maximizing the signal-
to-noise ratio (SNR). That is, they learned linear weights for feature combination
by maximizing the ratio between target saliency and distractor saliency. Besides
learning the explicit fusion functions, Kienzle et al. [ 29 ] proposed a nonparametric
approach to learn a visual saliency model from human eye-fixations on images.
A support vector machine (SVM) was trained to determine the saliency using the
local intensities. For video, Kienzle et al. [ 30 ] presented an approach to learn a
set of temporal filters from eye-fixations to find the interesting locations. On the
regional saliency dataset, Liu et al. [ 33 ] proposed a set of novel features and adopted
a conditional random field (CRF) to combine these features for salient object de-
tection. After that, they extended the approach to detect salient object sequences in
video [ 34 ].
Search WWH ::




Custom Search