Self-adaptation in Image and Video Retrieval - Multimedia Database Retrieval: Technology and Applications

Database Reference

In-Depth Information

Thus, the video interval

can be described by a set of descriptors

D V =

{ (

. The indexing of each video frame is obtained

by a vector quantization process. Specifically, let

x 1 ,

f 1 ) ,···, (

x m ,

f m ) ,···, (

x M ,

f M ) }

C = { (

g r ,

l r ) |

g r ∈ R

l r ∈

be a set of templates (or codewords) g r , where l r

is the label of the r -th template. This set is previously generated and optimized by

a competitive learning algorithm [ 331 ] (illustrated in Table 3.7 ). The vector x m is

mapped by

{

,...,

}

ₒ C

on the Voronoi space, i.e., quantizing the input vector by:

l x m

r ∗ ,

l x m

r ∗ , ( ʷ −

(

x m )= {

r ∗ ,

,...,

) }

(3.30)

where Q is the vector quantization function, l x m

is the label of the closest

r ∗

template, i.e.,

r ∗ =

arg min

( x m

−

g r )

(3.31)

and l x m

1 and l x m

are the labels for the first and the last neighbors of the wining

template g r ∗ , respectively.

Equation ( 3.30 ) obtains a multiple label indexing that is designed to describe

correlation information between the winning template and its neighbors. Figure 3.12

shows the example of this indexing process. Here we are interested not only in the

best-match template, but also the second (and up to

r ∗ ,

r ∗ , ( ʷ − 1 )

-th) best match. Once a cell is

selected, the

1 neighbors which have not yet been visited in the scan are then

also included in the output label set. This allows for interpretation of the correlation

information between the selected cell and its neighbors. Since a video sequence

usually has a very strong frame-to-frame correlation [ 102 ] due to the nature of time-

sequence data, embedding correlation information through Eq. ( 3.30 )offersabetter

description for video contents, and thus a means for more accurate discriminant

analysis. For example, two consecutive frames which are visually similar may not

be mapped into the same cell; rather, they may be mapped onto two cells in a

neighborhood area, so that mapping through multiple labels using Eq. ( 3.30 ) maps

two frames from the same class in the visual space into the same neighborhood area

in feature space.

The visual content of the video frame

ʷ −

f m is therefore characterized by the

l x m

r ∗ , ( ʷ −

membership of the label set,

{

r ∗ ,

,...,

) }

. The result of mapping all

r ∗ ,

frames, l x m

l x m

r ∗ , ( ʷ −

from the mapping of the entire

video interval V j are concatenated into a vector v j = w j 1 ,...,

r ∗ ,

,...,

, ∀

∈{

,...,

}

r ∗ ,

)

w jR .The

w jr ,...,

weight parameters are calculated by the TF

IDF weight scheme [ 323 ]:

F jr

max r F jr ×

log N

n r

w jr =

(3.32)

where the weight parameter F jr stands for a raw frequency of template g r in the

video interval

V j , i.e.,

Multimedia Database Retrieval: Technology and Applications

Search WWH ::

Custom Search

Home