Information Technology Reference
In-Depth Information
audio features both from the time and the frequency domain such as energy entropy,
short-time energy, zero crossing rate (ZCR), spectral flux, and roll-off are employed.
Apolynomial SVMis used as the classifier. Themain issuewith thiswork is that audio
signals are assumed to have already been segmented into semantically meaningful
nonoverlapping pieces (i.e., shots, explosions, fights, screams, music and speech).
In their chapter [ 7 ], de Souza et al., similarly to other works related to violence
detection, adopt their own definition of violence, and designate violent scenes as
those containing fights (i.e., aggressive human actions), regardless of the context and
the number of people involved. Their approach is based on the use of Bag-of-Words
(BoW), where local Spatial-Temporal Interest Point Features (STIP) are used as the
feature representation of video shots. They compare the performance of STIP-based
BoW with SIFT-based BoW on their own dataset, which contains 400 videos (200
violent and 200 nonviolent videos). The STIP-based BoW solution has proven to be
superior to the SIFT-based one.
Hassner et al. [ 18 ] present a method for real-time detection of breaking violence
in crowded scenes. They define violence as sudden changes in motion in a video
footage. The method considers statistics of magnitude changes of flow-vectors over
time. These statistics, collected for short frame sequences, are represented using the
Violent Flows (ViF) descriptor. ViF descriptors are then classified as either violent or
nonviolent using a linear SVM. The authors also introduce a new dataset of crowded
scenes onwhich their method is evaluated. According to the presented results, the ViF
descriptor outperforms the Local Trinary Patterns (LTP) [ 38 ], histogram of oriented
gradient (HoG) [ 23 ], histogram of oriented optical flow (HoF) [ 23 ] descriptors as
well as the histogramof oriented gradient and optical flow (HNF) descriptor [ 23 ]. The
ViF descriptor is also evaluated on well-known datasets of videos of noncrowded
scene such as the Hockey dataset [ 28 ] and the ASLAN dataset [ 22 ] in order to
assess its performance in action-classification tasks of “non-textured” videos (i.e.,
noncrowded). With small vocabularies, the ViF descriptor outperforms the LTP and
STIP descriptors, while with larger vocabularies, STIP outperforms ViF. However,
this performance gain comes with a higher computational cost.
In [ 35 ], Xu et al. propose to use Motion SIFT (MoSIFT) descriptors to extract
a low-level representation of a video. Feature selection is applied on the MoSIFT
descriptors using kernel density estimation. The selected features are subsequently
summarized into a mid-level feature representation based on a BoW model using
sparse coding. The method is evaluated on two different types of datasets: crowded
scenes [ 18 ] and noncrowded scenes [ 28 ]. Although Xu et al. do not explicitly define
violence, they study fights or sudden changes in motion as violence-related concepts.
The results show that the proposed method is promising and outperforms HoG-based
and HoF-based BoW representations on both datasets.
Second, we review multimodal methods, which constitute the most common type
of approach used in violent content detection in videos, and which consist in fus-
ing audio and visual cues at either feature or decision level. Aiming at detecting
horror, Wang et al. [ 5 ] apply Multiple Instance Learning (MIL; MI-SVM [ 3 ]) using
color, textual, and MFCC features. The authors do not explicitly state their definition
of horror. Therefore, assessing the performance of their method and identifying the
Search WWH ::




Custom Search