Scalable Indexing of HD Video - High-Quality Visual Experience

Information Technology Reference

In-Depth Information

or High Definition video and film content. The main stream research in indexing

and retrieval of video content nowadays avoids the complex, ill-posed “chicken and

egg” problem of extracting meaningful objects from video. It focuses on local fea-

tures such as SIFT descriptors proposed by Lowe [20]. Hence, in the paper entitled

“Unsupervised Object Discovery: A comparison”, [21], where the authors search

for images containing objects, one can read “Images are represented using local fea-

tures”. Pushing this reasoning to its extreme end, we come to the famous cat illusion

and make a “bottom-up” effort in visual content understanding. At the same time,

the strong effort of the multimedia research community related to the elaboration

of MPEG4, MPEG7 [22] and JPEG2000 (part 1) standards was devoted to the de-

velopment of automatic segmentation methods of video content to extract objects.

Here the approach is just the contrary: first an Entity has to be extracted and then

a description (sparse, dense, local or global) of it can be obtained. The results of

these methods, e.g. [23, 24, 25], while not always ensuring an ideal correspondence

of extracted object borders to visually observed contours, were sufficiently good for

fine-tuning of encoding parameters and for content description.

Hence, we are strongly convinced that the paradigm consisting of segmenting

objects first and then representing them in adequate feature spaces for object based

indexing and retrieval of video remains the promising road to the success and a

good alternative for local modeling of content by feature points. In the context of

scalable HD content, the object extraction process has to be adapted to the multiple

resolutions present in code-stream. It has to supply mid-level, object-based features

corresponding to each resolution level.

In [26] we proposed a full solution for mid-level global feature extraction for

generic objects in (M)JPEG2000 compressed content by an approach operating di-

rectly on the Daubechies 9/7 pyramid of a HD compressed stream. The underlying

assumptions of the method are as follows : i) we suppose that generic objects can be

“discovered” in video when the magnitude of object local ego-motions sufficiently

differs from the global motion, that of the camera ii) the high-frequency information

contained in HF subbands at each level of the wavelet pyramid can be efficiently re-

used for delimiting objects boundaries, iii) both LF and HF subbands are necessary

to convey global object features.

According to our indexing paradigm, the first step consists of extraction of

objects from a compressed stream. The overall strategy follows fruitful ideas of

cooperative motion-based and color-based spatio-temporal video object segmenta-

tion [11]. Here the areas of local motion have to be identified in video frames first.

They form the so-called motion masks M t at the lowest resolution level ( k = K

1)

of K -level Daubechies pyramid. Then a color-based segmentation of the low fre-

quency LL k subband has to be fulfilled on the whole subband. Finally motion

masks and segmentation map are merged by majority vote resulting in object masks

O t =

−

O t , i , i = 1 .. n ( k )

1. Objects at the top of the pyramid corresponding

to the lowest scalability level are thus extracted.

The object masks obtained are then projected on the higher resolution levels us-

ing the wavelet location principle (see Figure 5) allowing for establishing direct

{

}

, k = K

−

High-Quality Visual Experience

Search WWH ::

Custom Search

Home