Database Reference
In-Depth Information
extract relevant descriptions. The role of visual content in machine-driven labeling
has been long investigated and has resulted in a variety of content-based image and
video retrieval systems. Such systems commonly depend on low-level visual and
spatiotemporal features and are based on the query-by-example paradigm. As a
consequence, they are not effective if proper examples are unavailable. Further-
more, similarities in terms of low-level features do not easily translate to the high-
level, semantic similarity expected by users.
Concept-based video retrieval [ 50 ] tries to bridge this semantic gap and has
evolved over the last decade as a promising research field. It enables textual queries
to be carried out on multimedia databases by substituting manual indexing with
automatic detectors that mine media collections for semantic (visual) concepts.
This approach has proven effective and, when a large set of concept detectors are
available, its performance can be comparable with that of standard Web search
[ 51 ]. Concept detection relies on machine learning techniques and requires, to be
effective, that vast training sets be available to build large-scale concept diction-
aries and semantic relations. So far, the standard approach has been to employ
manually, expert-labeled training examples for concept learning. This solution is
costly and gives rise to additional inconveniences: the number of learned concepts
is limited, the insufficient scale of training data causes overfitting, and adapting to
changes (like new concepts of interest) remains difficult.
In [ 52 ], the huge video repository offered by YouTube is utilized as a novel
kind of knowledge base for the machine interpretation of multimedia data. Web
videos are exploited for two distinct purposes. On the one hand, result video clips
of a YouTube search for a given concept are employed as positive examples to
train the corresponding detector. Negative examples are drawn from other videos
not tagged with that concept. Frames are sampled from the videos and their visual
descriptors are fed to several statistical classifiers (Support Vector Machines,
Passive-Aggressive Online Learning, Maximum Entropy), whose performance
has been compared. On the other hand, tag co-occurrences in video annotations
are used to link concepts. For each concept, a bag-of-words representation is
extracted from the tags of the associated video clips. The process is then repeated
with user queries, which are thus mapped to the best matching learned concepts.
The approach has been evaluated on a large dataset (1,200 hours of video) by
manually selecting 233 concepts. Precision in detecting concepts rapidly increases
with the number of videoclips included in the training set and stabilizes when
100-150 videos are used. Results show that the average achieved precision,
although largely dependent on the concepts, is promising (32.2%), suggesting
that Web-based video collections indeed have the potential to support unsuper-
vised visual and semantic learning.
2.6.3.2 Automated Synchronization of Video Clips
The abundance of video material found in user-generated video collections could
enable a broad coverage of captured events. However, the lack of detailed semantic
and time-based metadata associated with video content makes the task of identify-
ing and synchronizing a set of video clips relative to a given event a nontrivial one.
Search WWH ::




Custom Search