Information Technology Reference
In-Depth Information
Training dataset
Video segments
The construction of audio & visual dictionaries
Sparse dictionary
learning
(m = 400, k = 1024)
Low-level Features
m × k
Audio Dictionary
(MFCC)
Visual Dictionaries
(HoG & HoF)
Fig. 11.2 The generation of audio and visual dictionaries with sparse coding. Each video segment
is of length 0.6 s. Low-level features are MFCCs, densely sampled HoG and HoF descriptors
are subsampled every 6 frames and HoF descriptors are subsampled every 2 frames as
recommended in [ 32 ]). The Horn-Schunk method [ 20 ] is applied to compute optical
flow vectors which are used for the extraction of HoF descriptors. The resulting HoG
and HoF descriptors are subsequently used to generate mid-level HoG and HoF
representations separately, which is illustrated in Fig. 11.1 . The construction of the
SC-based HoG and HoF dictionaries is illustrated in Fig. 11.2 .
11.3.2 Violence Detection Model
“Violence” is a concept, which can be expressed in diverse manners. For instance,
both explosions and scream scenes are labeled as violent according to the definition
that we adopted. However, these scenes might highly differ from each other in terms
of audio-visual appearance depending on their characteristics of violence. Therefore,
instead of learning a unique model for violence detection, learning multiple models
constitutes a more judicious choice. This justifies that we first perform feature space
 
Search WWH ::




Custom Search