Digital Signal Processing Reference
In-Depth Information
the explicit extraction of positive and negative samples by directly integrating both
the positive and unlabeled data into the optimization objective in a cost-sensitive
manner. In this process, we first recover the ground-truth saliency maps from the
limited eye fixations received by each frame. The basic principle is that visual sub-
sets that are adjacent and similar to the eye fixations should be assigned with high
saliency values. Toward this end, the visual similarity map is calculated to pop-out
the locations that are similar to the positive samples (as shown in Fig.
4.6
b), while
the spatial correlation map is derived to pop-out the neighbors of the eye fixations (as
shown in Fig.
4.6
c). Finally, the visual similarity map and the spatial correlation map
are combined to derive the ground-truth saliency map (as shown in Fig.
4.6
d). For
the sake of convenience, the ground-truth saliency values are normalized into
[
,
]
0
1
.
T
x
that can integrate various local visual attributes (represented by a column vector
x
)
for visual saliency estimation. For two locations
B
km
and
B
kn
with ground-truth
saliency values
g
km
and
g
kn
,
ϕ
(
ϕ
(
)=
ω
With the ground-truth saliency values, we train a ranking function
x
, indicates that
B
km
ranks higher than
B
kn
and maintains a higher saliency. In the training process, it is often difficult to
directly determine the label for each training sample, especially for the one with
medium ground-truth saliency (e.g., around 0.5). Therefore, we integrate all the pos-
itive and unlabeled data into a rank learning framework in a cost-sensitive manner.
Toward this end, the empirical loss can be defined as:
x
km
)
>
ϕ
(
x
kn
)
m
=
n
[
g
km
−
g
kn
]
+
ω
T
x
kn
1
(
ω
)=
k
T
x
km
≤
ω
L
(4.5)
Where
0.
We can see that there will be a loss if the ranking function gives predictions contrary
to the ground-truth saliencies. Moreover, the loss emphasizes the correlations be-
tween targets and distractors since the central issue in visual saliency estimation is to
distinguish targets from distractors. That is, the cost of erroneously ranking a target-
distractor pair (i.e.,
g
km
−
[
x
]
+
=
max
(
0
,
x
)
. Note that here
[
E
]
1
=
1ifevent
E
holds, otherwise
[
E
]
1
=
1) is much bigger than that of mistakenly predicting
the ranks between target pairs or between distractor pairs (i.e.,
g
km
−
g
kn
→
0). Thus
it is cost-sensitive by differentiating target-distractor pairs in our framework.
Often, it is difficult to minimize such a loss with binary terms. Thus we sim-
ply replace each binary term with its upper bound (e.g., exponential upper bound)
and obtain a convex optimization objective. After that, the global optimum can
be reached using gradient-based method and we can obtain the optimal ranking
function.
Experimental results show that our approach outperforms several state-of-the-art
bottom-up (e.g., [
13
,
15
,
16
,
19
,
20
,
24
,
68
]) and top-down (e.g., [
29
,
44
,
46
]) ap-
proaches in visual saliency estimation. On the prevalent video eye-fixation dataset
provided by Itti [
22
], our approach can reach an ROC score of 0.774. Some rep-
resentative examples are illustrated in Fig.
4.7
.FromFig.
4.7
, we can see that our
approach can effectively and accurately locate the entire salient objects in various
scenes.
gkm
→