Information Technology Reference
In-Depth Information
or High Definition video and film content. The main stream research in indexing
and retrieval of video content nowadays avoids the complex, ill-posed “chicken and
egg” problem of extracting meaningful objects from video. It focuses on local fea-
tures such as SIFT descriptors proposed by Lowe [20]. Hence, in the paper entitled
“Unsupervised Object Discovery: A comparison”, [21], where the authors search
for images containing objects, one can read “Images are represented using local fea-
tures”. Pushing this reasoning to its extreme end, we come to the famous cat illusion
and make a “bottom-up” effort in visual content understanding. At the same time,
the strong effort of the multimedia research community related to the elaboration
of MPEG4, MPEG7 [22] and JPEG2000 (part 1) standards was devoted to the de-
velopment of automatic segmentation methods of video content to extract objects.
Here the approach is just the contrary: first an Entity has to be extracted and then
a description (sparse, dense, local or global) of it can be obtained. The results of
these methods, e.g. [23, 24, 25], while not always ensuring an ideal correspondence
of extracted object borders to visually observed contours, were sufficiently good for
fine-tuning of encoding parameters and for content description.
Hence, we are strongly convinced that the paradigm consisting of segmenting
objects first and then representing them in adequate feature spaces for object based
indexing and retrieval of video remains the promising road to the success and a
good alternative for local modeling of content by feature points. In the context of
scalable HD content, the object extraction process has to be adapted to the multiple
resolutions present in code-stream. It has to supply mid-level, object-based features
corresponding to each resolution level.
In [26] we proposed a full solution for mid-level global feature extraction for
generic objects in (M)JPEG2000 compressed content by an approach operating di-
rectly on the Daubechies 9/7 pyramid of a HD compressed stream. The underlying
assumptions of the method are as follows : i) we suppose that generic objects can be
“discovered” in video when the magnitude of object local ego-motions sufficiently
differs from the global motion, that of the camera ii) the high-frequency information
contained in HF subbands at each level of the wavelet pyramid can be efficiently re-
used for delimiting objects boundaries, iii) both LF and HF subbands are necessary
to convey global object features.
According to our indexing paradigm, the first step consists of extraction of
objects from a compressed stream. The overall strategy follows fruitful ideas of
cooperative motion-based and color-based spatio-temporal video object segmenta-
tion [11]. Here the areas of local motion have to be identified in video frames first.
They form the so-called motion masks M t at the lowest resolution level ( k = K
1)
of K -level Daubechies pyramid. Then a color-based segmentation of the low fre-
quency LL k subband has to be fulfilled on the whole subband. Finally motion
masks and segmentation map are merged by majority vote resulting in object masks
O t =
O t , i , i = 1 .. n ( k )
1. Objects at the top of the pyramid corresponding
to the lowest scalability level are thus extracted.
The object masks obtained are then projected on the higher resolution levels us-
ing the wavelet location principle (see Figure 5) allowing for establishing direct
{
}
, k = K
 
Search WWH ::




Custom Search