Audio-Visual Fusion for Film Database Retrieval and Classification - Multimedia Database Retrieval: Technology and Applications

Database Reference

In-Depth Information

10.4

Audio-Visual Fusion

When retrieving clips from a database, the user is searching not only for clips

that are similar to the query in terms of the visual aspect, but also in terms of

the audiovisual aspect to convey a semantic concept-based query interface. The

term 'semantic concept' describes the characteristic of the desired clip derived

from its distinct characteristics and expressed through audiovisual signals. The

semantic may be interpreted as logical story units, events, and activities, such

as airplane flying, car crashing, rioting, and so on. Retrieval using concepts has

been performed in many domain applications. For example, Sudhir et al. [ 318 ] and

Miyamori et al. [ 317 ] addressed the semantics of tennis games with concepts such

as baseline-rallies, passing-shots, net games, and server-and-volley. These were

derived by rule-based mechanisms as well as low-level features such as player

position, dominant colors, and mass center [ 316 ]. In more recent works, Lay et al.

[ 314 , 315 ] presented elemental concept indexing (ECI) by defining concepts using a

compound of annotated words that can be decomposed into more elementary units,

and applying grammar rules to support query operations.

This section presents the indexing and retrieval methods that derive semantics

based on perceptual features and a machine learning based fusion model. The

semantic concepts are associated with perceptual features instead of annotating

terms to specify a finer perception. This interface can be applied for the film

editing/making process where the perception characteristics in the scene are very

difficult to express in words. In the composition of the scene, the notion of

Mise-en-scène , where the design of the props and the setting revolve around the

scene, is implemented to enhance its potency [ 313 ]. Mise-en-scène means “put

in the scene” for almost everything including the composition itself: framing,

movement of the camera and characters, lighting, set design and general visual

environment, even sound as it helps elaborate the composition [ 312 ]. Such scene

units are difficult to characterize with textual descriptions, while they are more

easily subjected to feature extraction at the signal level. The audiovisual fusion

model employing perceptual features provides a highly efficient interface to retrieve

movie clips for concepts such as Love Scene, Music Video, Fighting, Ship Crashing ,

and Dance Party . Here these concepts are described with textual descriptions for

communication with readers. However, our definitions of the semantic concepts are

based on perceptual features in the video and not text descriptors.

The SVM model is adopted for fusion of audiovisual features for characterization

of semantic concepts according to perceptual features . Although the SVM is a

well-established machine learning technique [ 310 ], its application for the fusion

of multimodality features has only been recently studied. A SVM-based decision

fusion technique has been employed for cartridge identification [ 311 ], as well as for

personal identity verification [ 306 ]. However, the SVMs have not previously been

applied in decision fusion for the detection of semantic concepts in video.

Search WWH ::

Custom Search

Home