Digital Signal Processing Reference
In-Depth Information
Fig. 4.1
An example of the faceted representation of scene semantics
many semantic information only from this scene: four persons (a man, a woman
and two children) are waving in an assembly. With contextual knowledge, one
can further know this scene is about B. Obama with his wife and two daughters
when announcing his presidential campaign in Springfield, Illinois, USA. However,
the semantic descriptions that can be automatically inferred by a learning system
are very limited. For example, the scene might be categorized into “outdoor” or
“city” by image classification algorithms or annotated as “
crowd
,
flag
,
building
”
by automatic annotation models; the person in the scene might be recognized as
“B. Obama” by face recognition algorithms; we can also use object localization al-
gorithms to learn the spatial relationship of objects (e.g., B Obama
Center-of
the
picture); furthermore, high-level concept detection algorithms can be used to de-
tect activities (e.g., “
waving
”) or events (e.g., “
assembly
”). As shown in Fig.
4.1
,
these semantics can be summarized along the four aspects-
which
(semantic types or
categories),
what
(objects or scenes),
where
(spatial relationships), and
how
(actions,
activities or events):
1.
Which - Semantic Types and Categories
:The
which
facet typically refers to
semantic types or categories of scenes. Given a taxonomy, this facet helps an-
swer the question:
which type or category is the scene?
Description of scene
semantics using the
which
facet is very general, but prove to be of great impor-
tance for either organizing unseen images/scenes into broad categories, or for
semantic-based retrieval from large-scale collections.
2.
What - Objects and Scenes
:The
what
facet describes the objects and scenes in an
image/video. It answers the question
what is the subject (object/scene, etc.) in it?
Extracting the
what
facets from scenes covers a wide range of visual learning