Scalable Indexing of HD Video - High-Quality Visual Experience

Information Technology Reference

In-Depth Information

threshold, then it means that I ij cannot be efficiently encoded with VC i and hence

most probably belongs to the next shot. The visual code-book VC i is entropy-

encoded and embedded in the code stream. All the B -frames of a given GOP are

encoded applying a conventional motion estimation and compensation method.

In this approach, the visual code-book plays two roles: on the one hand, it is used

for decoding the key frame; on the other hand, it is a low-level content descriptor.

Indeed, in their previous work [17] the authors showed that visual code-books are

good low-level descriptors of the content. Since this pioneer work visual code books

have been applied in specific video description spaces [18] and become very popular

in visual content indexing [19].

Assuming the visual content to be quantized using a VC , the similarity between

two images and/or two shots can be estimated by evaluating the distortion intro-

duced if the role of two VC is exchanged. The authors propose a symmetric form:

( S i , S j )= D VC j ( S i )

D VC i ( S i ) + D VC i ( S j )

D VC j ( S j ) ,

φ

−

(7)

where D VC j ( S i ) is a distortion of encoding a shot (or key-frame) S i with the visual

code-book D VC j .

Visual distortion is understood as for usual case of VQ encoding:

N i

p =1 || v p − c j ( v p ) ||

D VC j ( S i )= 1

2

(8)

N i

Here v p is the visual vector representing a key-frame pixel block, N i is the number

of blocks in the key-frame (or whole shot S i ), c j ( v p ) is the vector in a code-book

closest to v p in the sense of Euclidean distance.

The symmetric form in Eq. (7) is a good measure of dissimilarity between shots.

Hence the temporal scalability of video index can be obtained by grouping shots on

the basis of Eq. (7).

Thus two scalable mid-level indexes are implicitly embedded in a scalable code-

stream: the two different code-books for the subsequent groups of GOPs indicate a

shot boundary, a scene boundary can be obtained when parsing and grouping shots

with Eq. (8).

On the other hand decoding of visual code-books and representation of key-

frames only with code-words supplies the base level of spatial scalability. The en-

hancement levels can be obtained by scalable decoding of VQ error on key-frames.

This scheme is very interesting for the HD content: a quick browsing does not

require decoding full HD, but only the base layer can be used to visualize the frames.

Reduced temporal resolution can also be achieved when parsing.

2.2.2

Object-Based Mid-level Features from (M)JPEG2000 Compressed

Stream

Object-based indexing of compressed content remains one the most difficult prob-

lems in the the vast set of indexing tasks be it Low Definition, Standard Definition

High-Quality Visual Experience

Search WWH ::

Custom Search

Home