Community-Contributed Media Collections: Knowledge at Our Fingertips - Community-Built Databases: Research and Development

Database Reference

In-Depth Information

extract relevant descriptions. The role of visual content in machine-driven labeling

has been long investigated and has resulted in a variety of content-based image and

video retrieval systems. Such systems commonly depend on low-level visual and

spatiotemporal features and are based on the query-by-example paradigm. As a

consequence, they are not effective if proper examples are unavailable. Further-

more, similarities in terms of low-level features do not easily translate to the high-

level, semantic similarity expected by users.

Concept-based video retrieval [ 50 ] tries to bridge this semantic gap and has

evolved over the last decade as a promising research field. It enables textual queries

to be carried out on multimedia databases by substituting manual indexing with

automatic detectors that mine media collections for semantic (visual) concepts.

This approach has proven effective and, when a large set of concept detectors are

available, its performance can be comparable with that of standard Web search

[ 51 ]. Concept detection relies on machine learning techniques and requires, to be

effective, that vast training sets be available to build large-scale concept diction-

aries and semantic relations. So far, the standard approach has been to employ

manually, expert-labeled training examples for concept learning. This solution is

costly and gives rise to additional inconveniences: the number of learned concepts

is limited, the insufficient scale of training data causes overfitting, and adapting to

changes (like new concepts of interest) remains difficult.

In [ 52 ], the huge video repository offered by YouTube is utilized as a novel

kind of knowledge base for the machine interpretation of multimedia data. Web

videos are exploited for two distinct purposes. On the one hand, result video clips

of a YouTube search for a given concept are employed as positive examples to

train the corresponding detector. Negative examples are drawn from other videos

not tagged with that concept. Frames are sampled from the videos and their visual

descriptors are fed to several statistical classifiers (Support Vector Machines,

Passive-Aggressive Online Learning, Maximum Entropy), whose performance

has been compared. On the other hand, tag co-occurrences in video annotations

are used to link concepts. For each concept, a bag-of-words representation is

extracted from the tags of the associated video clips. The process is then repeated

with user queries, which are thus mapped to the best matching learned concepts.

The approach has been evaluated on a large dataset (1,200 hours of video) by

manually selecting 233 concepts. Precision in detecting concepts rapidly increases

with the number of videoclips included in the training set and stabilizes when

100-150 videos are used. Results show that the average achieved precision,

although largely dependent on the concepts, is promising (32.2%), suggesting

that Web-based video collections indeed have the potential to support unsuper-

vised visual and semantic learning.

2.6.3.2 Automated Synchronization of Video Clips

The abundance of video material found in user-generated video collections could

enable a broad coverage of captured events. However, the lack of detailed semantic

and time-based metadata associated with video content makes the task of identify-

ing and synchronizing a set of video clips relative to a given event a nontrivial one.

Community-Built Databases: Research and Development

Search WWH ::

Custom Search

Home