Video Scene Analysis: A Machine Learning Perspective - Video Segmentation and Its Applications

Digital Signal Processing Reference

In-Depth Information

the explicit extraction of positive and negative samples by directly integrating both

the positive and unlabeled data into the optimization objective in a cost-sensitive

manner. In this process, we first recover the ground-truth saliency maps from the

limited eye fixations received by each frame. The basic principle is that visual sub-

sets that are adjacent and similar to the eye fixations should be assigned with high

saliency values. Toward this end, the visual similarity map is calculated to pop-out

the locations that are similar to the positive samples (as shown in Fig. 4.6 b), while

the spatial correlation map is derived to pop-out the neighbors of the eye fixations (as

shown in Fig. 4.6 c). Finally, the visual similarity map and the spatial correlation map

are combined to derive the ground-truth saliency map (as shown in Fig. 4.6 d). For

the sake of convenience, the ground-truth saliency values are normalized into

[

]

T x

that can integrate various local visual attributes (represented by a column vector x )

for visual saliency estimation. For two locations B km and B kn with ground-truth

saliency values g km and g kn , ϕ (

ϕ (

)= ω

With the ground-truth saliency values, we train a ranking function

, indicates that B km ranks higher than

B kn and maintains a higher saliency. In the training process, it is often difficult to

directly determine the label for each training sample, especially for the one with

medium ground-truth saliency (e.g., around 0.5). Therefore, we integrate all the pos-

itive and unlabeled data into a rank learning framework in a cost-sensitive manner.

Toward this end, the empirical loss can be defined as:

x km ) > ϕ (

x kn )

m = n [ g km − g kn ] + ω

T x kn 1

( ω )= k

T x km ≤ ω

(4.5)

Where

We can see that there will be a loss if the ranking function gives predictions contrary

to the ground-truth saliencies. Moreover, the loss emphasizes the correlations be-

tween targets and distractors since the central issue in visual saliency estimation is to

distinguish targets from distractors. That is, the cost of erroneously ranking a target-

distractor pair (i.e., g km −

[

] + =

max

(

)

. Note that here

[

] 1 =

1ifevent E holds, otherwise

[

] 1 =

1) is much bigger than that of mistakenly predicting

the ranks between target pairs or between distractor pairs (i.e., g km −

g kn →

0). Thus

it is cost-sensitive by differentiating target-distractor pairs in our framework.

Often, it is difficult to minimize such a loss with binary terms. Thus we sim-

ply replace each binary term with its upper bound (e.g., exponential upper bound)

and obtain a convex optimization objective. After that, the global optimum can

be reached using gradient-based method and we can obtain the optimal ranking

function.

Experimental results show that our approach outperforms several state-of-the-art

bottom-up (e.g., [ 13 , 15 , 16 , 19 , 20 , 24 , 68 ]) and top-down (e.g., [ 29 , 44 , 46 ]) ap-

proaches in visual saliency estimation. On the prevalent video eye-fixation dataset

provided by Itti [ 22 ], our approach can reach an ROC score of 0.774. Some rep-

resentative examples are illustrated in Fig. 4.7 .FromFig. 4.7 , we can see that our

approach can effectively and accurately locate the entire salient objects in various

scenes.

gkm

→

Video Segmentation and Its Applications

Search WWH ::

Custom Search

Home