Digital Signal Processing Reference
In-Depth Information
Fig. 4.2
An illustration of two research challenges in video content analysis and retrieval
Generally speaking, the main challenge of video content analysis is understanding
media by bridging the semantic gap between the bit stream on the one hand and the
visual content interpretation by humans on the other. Clearly, the semantic gap is a
fundamental problem in multimedia analysis and indexing that almost all research
papers in the field must address [ 50 ]. As a computational problem, the semantic gap
is tightly related to the modeling and analysis of video scenes. As pointed out by
Hare et al. [ 14 ], the semantic gap can be further divided into two major sections:
the gap between the low-level descriptors and object labels, and the gap between
the labeled objects and the full semantics. For reasons that will become clear later,
we refer to them respectively as the which-what gap and the where-how gap. It is
should be noted that although expressed by the two designations, the full connota-
tions of the two sections of the semantic gap are much broader. In Sect. 4.3.2 ,we
will present a general framework for capturing various facets of scene semantics by
taking into account both temporal and spatial contexts.
Another research challenge that has received much attention of research in the
multimedia retrieval field is a gap between users' search intents and the queries,
called intention gap [ 67 ]. Due to the incapability of keyword queries to express
users' intents, intention gap often leads to unsatisfying search results. Despite origi-
nated from multimedia retrieval, this gap may exist in a boarder range of multimedia
applications such as user-targeted video advertising and content-based filtering. For
example, a less intrusive model of advertising is only displaying advertising infor-
mation when the user makes the choice by clicking on an object in a video. Since
it is the user who requests the product information, this type of advertising is better
targeted and likely to be more effective. By learning user's visual attention patterns,
the hot-spots that correspond to brands can be further highlighted so as to extract
more attention of users.
Often, visual attention is operationalized as a selection mechanism to filter out
unwanted information in a scene [ 21 ]. By focusing on the attractive region, a scene
can be analyzed in a user-targeted manner. Generally, determining which region will
Search WWH ::




Custom Search