Information Technology Reference
In-Depth Information
that the method outperforms the approaches which use no external data (e.g., Internet
resources) in the MediaEval 2013 VSD task.
Derbas and Quénot [ 10 ] explore the joint dependence of audio and visual features
for violent scene detection. They first combine the audio and the visual features and
then determine statistically joint multimodal patterns. The proposed method mainly
relies on an audio-visual BoW representation. The experiments are performed in the
context of the MediaEval 2013 VSD task. The obtained results show the potential of
the proposed approach in comparison to methods which use audio and visual features
separately, and to other fusion methods such as early and late fusion.
11.3 The Violence Detection Method
In this section, we discuss (1) the representation of video segments, and (2) the
learning of a violence model, which are the two main components of our method.
11.3.1 Video Representation
Sound effects and background music in movies are essential for stimulating people's
perception [ 33 ]. Therefore, the audio signals are important for the representation
of videos. Visual content of videos provides complementary information for the
detection of violence in videos. We represent the audio content using mid-level
representations, whereas the visual content is represented at two different levels:
low-level and mid-level.
11.3.1.1 Mid-Level Audio Representation
Mid-level audio representations are based on MFCC features extracted from the
audio signals of video segments of 0.6 s length as illustrated in Fig. 11.1 . In order to
generate the mid-level representations for video segments, we apply an abstraction
process which uses an MFCC-based Bag-of-Audio Words (BoAW) approach with
sparse coding (SC) as the coding scheme.
The construction of the SC-based audio dictionary is illustrated in Fig. 11.2 .We
employ the dictionary learning technique presented in [ 26 ]. The advantage of this
technique is its scalability to very large datasets containing millions of training sam-
ples which makes the technique well suited for our work. In order to learn the dic-
tionary of size k ( k
k MFCC feature
vectors are sampled from the training data (experimentally determined figure). In the
coding phase, we construct the sparse representations of audio signals by using the
LARS algorithm [ 12 ]. Given an audio signal and a dictionary, the LARS algorithm
returns sparse representations for MFCC feature vectors. In order to generate the final
sparse representation of video segments, which is a set of MFCC feature vectors, we
apply the max-pooling technique.
=
1
,
024 in this work) for sparse coding, 400
×
Search WWH ::




Custom Search