Information Technology Reference
In-Depth Information
threshold, then it means that I ij cannot be efficiently encoded with VC i and hence
most probably belongs to the next shot. The visual code-book VC i is entropy-
encoded and embedded in the code stream. All the B -frames of a given GOP are
encoded applying a conventional motion estimation and compensation method.
In this approach, the visual code-book plays two roles: on the one hand, it is used
for decoding the key frame; on the other hand, it is a low-level content descriptor.
Indeed, in their previous work [17] the authors showed that visual code-books are
good low-level descriptors of the content. Since this pioneer work visual code books
have been applied in specific video description spaces [18] and become very popular
in visual content indexing [19].
Assuming the visual content to be quantized using a VC , the similarity between
two images and/or two shots can be estimated by evaluating the distortion intro-
duced if the role of two VC is exchanged. The authors propose a symmetric form:
( S i , S j )= D VC j ( S i )
D VC i ( S i ) + D VC i ( S j )
D VC j ( S j ) ,
φ
(7)
where D VC j ( S i ) is a distortion of encoding a shot (or key-frame) S i with the visual
code-book D VC j .
Visual distortion is understood as for usual case of VQ encoding:
N i
p =1 || v p c j ( v p ) ||
D VC j ( S i )= 1
2
(8)
N i
Here v p is the visual vector representing a key-frame pixel block, N i is the number
of blocks in the key-frame (or whole shot S i ), c j ( v p ) is the vector in a code-book
closest to v p in the sense of Euclidean distance.
The symmetric form in Eq. (7) is a good measure of dissimilarity between shots.
Hence the temporal scalability of video index can be obtained by grouping shots on
the basis of Eq. (7).
Thus two scalable mid-level indexes are implicitly embedded in a scalable code-
stream: the two different code-books for the subsequent groups of GOPs indicate a
shot boundary, a scene boundary can be obtained when parsing and grouping shots
with Eq. (8).
On the other hand decoding of visual code-books and representation of key-
frames only with code-words supplies the base level of spatial scalability. The en-
hancement levels can be obtained by scalable decoding of VQ error on key-frames.
This scheme is very interesting for the HD content: a quick browsing does not
require decoding full HD, but only the base layer can be used to visualize the frames.
Reduced temporal resolution can also be achieved when parsing.
2.2.2
Object-Based Mid-level Features from (M)JPEG2000 Compressed
Stream
Object-based indexing of compressed content remains one the most difficult prob-
lems in the the vast set of indexing tasks be it Low Definition, Standard Definition
 
Search WWH ::




Custom Search