Detecting Violent Content in Hollywood Movies and User-Generated Videos - Smart Information Systems: Computational Intelligence for Real-Life Applications

Information Technology Reference

In-Depth Information

that the method outperforms the approaches which use no external data (e.g., Internet

resources) in the MediaEval 2013 VSD task.

Derbas and Quénot [ 10 ] explore the joint dependence of audio and visual features

for violent scene detection. They first combine the audio and the visual features and

then determine statistically joint multimodal patterns. The proposed method mainly

relies on an audio-visual BoW representation. The experiments are performed in the

context of the MediaEval 2013 VSD task. The obtained results show the potential of

the proposed approach in comparison to methods which use audio and visual features

separately, and to other fusion methods such as early and late fusion.

11.3 The Violence Detection Method

In this section, we discuss (1) the representation of video segments, and (2) the

learning of a violence model, which are the two main components of our method.

11.3.1 Video Representation

Sound effects and background music in movies are essential for stimulating people's

perception [ 33 ]. Therefore, the audio signals are important for the representation

of videos. Visual content of videos provides complementary information for the

detection of violence in videos. We represent the audio content using mid-level

representations, whereas the visual content is represented at two different levels:

low-level and mid-level.

11.3.1.1 Mid-Level Audio Representation

Mid-level audio representations are based on MFCC features extracted from the

audio signals of video segments of 0.6 s length as illustrated in Fig. 11.1 . In order to

generate the mid-level representations for video segments, we apply an abstraction

process which uses an MFCC-based Bag-of-Audio Words (BoAW) approach with

sparse coding (SC) as the coding scheme.

The construction of the SC-based audio dictionary is illustrated in Fig. 11.2 .We

employ the dictionary learning technique presented in [ 26 ]. The advantage of this

technique is its scalability to very large datasets containing millions of training sam-

ples which makes the technique well suited for our work. In order to learn the dic-

tionary of size k ( k

k MFCC feature

vectors are sampled from the training data (experimentally determined figure). In the

coding phase, we construct the sparse representations of audio signals by using the

LARS algorithm [ 12 ]. Given an audio signal and a dictionary, the LARS algorithm

returns sparse representations for MFCC feature vectors. In order to generate the final

sparse representation of video segments, which is a set of MFCC feature vectors, we

apply the max-pooling technique.

=

1

,

024 in this work) for sparse coding, 400

×

Search WWH ::

Custom Search

Home