Database Reference
In-Depth Information
10.4
Audio-Visual Fusion
When retrieving clips from a database, the user is searching not only for clips
that are similar to the query in terms of the visual aspect, but also in terms of
the audiovisual aspect to convey a semantic concept-based query interface. The
term 'semantic concept' describes the characteristic of the desired clip derived
from its distinct characteristics and expressed through audiovisual signals. The
semantic may be interpreted as logical story units, events, and activities, such
as airplane flying, car crashing, rioting, and so on. Retrieval using concepts has
been performed in many domain applications. For example, Sudhir et al. [ 318 ] and
Miyamori et al. [ 317 ] addressed the semantics of tennis games with concepts such
as baseline-rallies, passing-shots, net games, and server-and-volley. These were
derived by rule-based mechanisms as well as low-level features such as player
position, dominant colors, and mass center [ 316 ]. In more recent works, Lay et al.
[ 314 , 315 ] presented elemental concept indexing (ECI) by defining concepts using a
compound of annotated words that can be decomposed into more elementary units,
and applying grammar rules to support query operations.
This section presents the indexing and retrieval methods that derive semantics
based on perceptual features and a machine learning based fusion model. The
semantic concepts are associated with perceptual features instead of annotating
terms to specify a finer perception. This interface can be applied for the film
editing/making process where the perception characteristics in the scene are very
difficult to express in words. In the composition of the scene, the notion of
Mise-en-scène , where the design of the props and the setting revolve around the
scene, is implemented to enhance its potency [ 313 ]. Mise-en-scène means “put
in the scene” for almost everything including the composition itself: framing,
movement of the camera and characters, lighting, set design and general visual
environment, even sound as it helps elaborate the composition [ 312 ]. Such scene
units are difficult to characterize with textual descriptions, while they are more
easily subjected to feature extraction at the signal level. The audiovisual fusion
model employing perceptual features provides a highly efficient interface to retrieve
movie clips for concepts such as Love Scene, Music Video, Fighting, Ship Crashing ,
and Dance Party . Here these concepts are described with textual descriptions for
communication with readers. However, our definitions of the semantic concepts are
based on perceptual features in the video and not text descriptors.
The SVM model is adopted for fusion of audiovisual features for characterization
of semantic concepts according to perceptual features . Although the SVM is a
well-established machine learning technique [ 310 ], its application for the fusion
of multimodality features has only been recently studied. A SVM-based decision
fusion technique has been employed for cartridge identification [ 311 ], as well as for
personal identity verification [ 306 ]. However, the SVMs have not previously been
applied in decision fusion for the detection of semantic concepts in video.
Search WWH ::




Custom Search