Detecting Violent Content in Hollywood Movies and User-Generated Videos - Smart Information Systems: Computational Intelligence for Real-Life Applications - page 295

Information Technology Reference

In-Depth Information

Training dataset

Video segments

The construction of audio & visual dictionaries

Sparse dictionary

learning

(m = 400, k = 1024)

Low-level Features

m × k

Audio Dictionary

(MFCC)

Visual Dictionaries

(HoG & HoF)

Fig. 11.2 The generation of audio and visual dictionaries with sparse coding. Each video segment

is of length 0.6 s. Low-level features are MFCCs, densely sampled HoG and HoF descriptors

are subsampled every 6 frames and HoF descriptors are subsampled every 2 frames as

recommended in [ 32 ]). The Horn-Schunk method [ 20 ] is applied to compute optical

flow vectors which are used for the extraction of HoF descriptors. The resulting HoG

and HoF descriptors are subsequently used to generate mid-level HoG and HoF

representations separately, which is illustrated in Fig. 11.1 . The construction of the

SC-based HoG and HoF dictionaries is illustrated in Fig. 11.2 .

11.3.2 Violence Detection Model

“Violence” is a concept, which can be expressed in diverse manners. For instance,

both explosions and scream scenes are labeled as violent according to the definition

that we adopted. However, these scenes might highly differ from each other in terms

of audio-visual appearance depending on their characteristics of violence. Therefore,

instead of learning a unique model for violence detection, learning multiple models

constitutes a more judicious choice. This justifies that we first perform feature space

Next Page

Smart Information Systems: Computational Intelligence for Real-Life Applications

Search WWH ::

Custom Search

Home