Information Technology Reference
In-Depth Information
For scene analysis, the goal is to understand what a scene contains. Automatic
image annotation is the process of automatically producing words to describe the
content for a given image. In other words, the goal of automatic image annotation
is to describe a previously unseen image with a subset of words from a given vo-
cabulary [38]. Reviews of various aspects of image annotation are presented in
many recent papers [7, 18, 29, 67, 68, 77]. There is no simple mapping from raw
images or videos to dictionary terms. One approach builds a dictionary using vec-
tor quantization over a large set of visual descriptors extracted from a training set,
and uses a nearest-neighbor algorithm to count the number of occurrences of each
dictionary word in documents to be encoded. More robust approaches have been
proposed that represent each visual descriptor as a sparse weighted combination of
dictionary words. While favoring a sparse representation at the level of visual de-
scriptors, those methods however do not ensure that images have sparse represen-
tation. A visual similarity does not guarantee semantic similarity, for example,
visual similarities exist between images of bears in snow and airplanes in the sky.
Wang et al . [68] cluster correlated keywords into topics in the local neighborhood
in order to reduce the number of labels, and develop an iterative method to max-
imize the margins between classes in both the visual and the semantic spaces.
Bengio et al. [7] use mixed-norm regularization to achieve sparsity at the image
level as well as a small overall dictionary.
Many methods have been suggested in recent years to solve the multi-instance
and multi-label scene annotation problem. For example, Zhou et al. [80] present
two methods for multi-instance and multi-label classification. Feng and Xu [18]
review previous work, and present a transductive method (TMIML) for automatic
multi-label and multi-instance image annotation, which is combined with graph-
based semi-supervised learning.
In most multi-label classification methods, all the labels come from the same
category type (a general pool), and a detector is built to differentiate between
classes. Zhu [79] uses semantic scene concept learning for autonomous agents that
interact with human users, and robotic systems that aim to navigate in real envi-
ronments, recognize and retrieve objects based on natural language commands. In
such cases, the agents have to learn to deal with mutually non-exclusive scene la-
bel s such as red , ball and cat , without being told their category types in advance.
Zhu adds an intermediate level of concept category (label), such as color and
shape , built of joint probability density functions of the visual features . Objects
such as a Pepsi can are then associated with these concepts categories, using a
two-level Bayesian inference network. However, each object is associated with
only a single color and a single shape of the concept categories.
Video recordings pose additional challenges due to the conceptual spatial-
temporal relations between entities in consecutive frames [35, 47, 73]. El-
Kaliouby and Robinson [17] present a relatively simple continuous monitoring of
six real-time HMMs, each recognizes a certain affective state from the recent
string of head gestures and facial expressions. Continuous monitoring of the
summed combinations of binary classifiers, used for tracking of subtle changes
and nuances of vocal non-verbal expressions of affective states during sustained
interactions is also presented [59].
Search WWH ::




Custom Search