Digital Signal Processing Reference
In-Depth Information
the explicit extraction of positive and negative samples by directly integrating both
the positive and unlabeled data into the optimization objective in a cost-sensitive
manner. In this process, we first recover the ground-truth saliency maps from the
limited eye fixations received by each frame. The basic principle is that visual sub-
sets that are adjacent and similar to the eye fixations should be assigned with high
saliency values. Toward this end, the visual similarity map is calculated to pop-out
the locations that are similar to the positive samples (as shown in Fig. 4.6 b), while
the spatial correlation map is derived to pop-out the neighbors of the eye fixations (as
shown in Fig. 4.6 c). Finally, the visual similarity map and the spatial correlation map
are combined to derive the ground-truth saliency map (as shown in Fig. 4.6 d). For
the sake of convenience, the ground-truth saliency values are normalized into
[
,
]
0
1
.
T x
that can integrate various local visual attributes (represented by a column vector x )
for visual saliency estimation. For two locations B km and B kn with ground-truth
saliency values g km and g kn , ϕ (
ϕ (
)= ω
With the ground-truth saliency values, we train a ranking function
x
, indicates that B km ranks higher than
B kn and maintains a higher saliency. In the training process, it is often difficult to
directly determine the label for each training sample, especially for the one with
medium ground-truth saliency (e.g., around 0.5). Therefore, we integrate all the pos-
itive and unlabeled data into a rank learning framework in a cost-sensitive manner.
Toward this end, the empirical loss can be defined as:
x km ) > ϕ (
x kn )
m = n [ g km g kn ] + ω
T x kn 1
( ω )= k
T x km ω
L
(4.5)
Where
0.
We can see that there will be a loss if the ranking function gives predictions contrary
to the ground-truth saliencies. Moreover, the loss emphasizes the correlations be-
tween targets and distractors since the central issue in visual saliency estimation is to
distinguish targets from distractors. That is, the cost of erroneously ranking a target-
distractor pair (i.e., g km
[
x
] + =
max
(
0
,
x
)
. Note that here
[
E
] 1 =
1ifevent E holds, otherwise
[
E
] 1 =
1) is much bigger than that of mistakenly predicting
the ranks between target pairs or between distractor pairs (i.e., g km
g kn
0). Thus
it is cost-sensitive by differentiating target-distractor pairs in our framework.
Often, it is difficult to minimize such a loss with binary terms. Thus we sim-
ply replace each binary term with its upper bound (e.g., exponential upper bound)
and obtain a convex optimization objective. After that, the global optimum can
be reached using gradient-based method and we can obtain the optimal ranking
function.
Experimental results show that our approach outperforms several state-of-the-art
bottom-up (e.g., [ 13 , 15 , 16 , 19 , 20 , 24 , 68 ]) and top-down (e.g., [ 29 , 44 , 46 ]) ap-
proaches in visual saliency estimation. On the prevalent video eye-fixation dataset
provided by Itti [ 22 ], our approach can reach an ROC score of 0.774. Some rep-
resentative examples are illustrated in Fig. 4.7 .FromFig. 4.7 , we can see that our
approach can effectively and accurately locate the entire salient objects in various
scenes.
gkm
Search WWH ::




Custom Search