Detecting Violent Content in Hollywood Movies and User-Generated Videos - Smart Information Systems: Computational Intelligence for Real-Life Applications - page 294

Information Technology Reference

In-Depth Information

Sparse

Dictionaries

Audio Signals

Video segment

(of 0.6 second length)

40 ms

Compute dense

HoG & HoF

MFCC features

(13-dimensional)

SC-based audio

representation

(1024-dimensional)

SC-based HoG & HoF

representations

(2048-dimensional)

HoG & HoF features

(144-dimensional)

Sparse coding of

audio & visual content

Fig. 11.1 The generation process of SC-based audio and visual representations for video segments.

Each video segment is of length 0.6 s. Separate dictionaries are constructed and used for MFCC,

HoG and HoF to generate 1,024-dimensional representations. Each HoG and HoF descriptor is

144-dimensional

11.3.1.2 Low-Level Visual Representation

Film-makers usually make use of motion in order to elicit some particular perception

in the audience [ 33 ]. Therefore, we use motion-related descriptors for the visual

representation of video segments. One of the motion descriptors is ViF which is an

efficient motion descriptor. We computed a ViF descriptor for each video segment

to represent statistics of flow-vector magnitude changes over time. For a detailed

explanation of the computation of this descriptor, the reader is referred to [ 18 ].

In addition to motion information, static content of video frames is also important

for evoking some particular perception in the audience [ 33 ]. We, therefore, also

use static content representations in our work. More specifically, we employ affect-

related static visual descriptors. Inspired by the work presented in [ 25 ], we compute

mean and standard deviation of saturation, brightness, and hue in the HSL color

space. We also compute the colorfulness of the keyframe of video segments using

the method in [ 17 ], where the keyframe is deemed to be the frame in the middle of

a video segment.

11.3.1.3 Mid-level Visual Representation

Mid-level visual representations are based on HoG and HoF features extracted from

the visual content of video segments of 0.6 s length. HoG and HoF descriptors are

densely sampled and computed for subvolumes of video segments (HoG descriptors

Next Page

Smart Information Systems: Computational Intelligence for Real-Life Applications

Search WWH ::

Custom Search

Home