Information Technology Reference
In-Depth Information
where average motion, camera motion, and average shot length are used for scene
representation and SVM for classification, video scenes are classified into action
and nonaction. In the second phase, using the “Viola-Jones” face detector, faces are
detected in each keyframe of action scenes and the presence of blood pixels near
detected human faces is checked using color information. The approach is compared
with the method of Lin and Wang [ 24 ] because of the similar violence definitions,
and is shown to perform better in terms of precision and recall.
Ding et al. [ 11 ] observe that most existing methods identify horror scenes only
from independent frames, ignoring the context cues among frames in a video scene.
In order to consider contextual cues in horror scene recognition, they propose a Mul-
tiviewMIL (M 2 IL) model based on a joint sparse coding technique which simultane-
ously takes into account the bag of instances from the independent view and from the
contextual view. Their definition of violence is very similar to the definition in [ 5 ].
They perform experiments on a horror video dataset collected from the Internet and
the results demonstrate that the performance of the proposed method is superior to
other existing well-known MIL algorithms.
The works discussed in the following paragraphs employ the same definitions of
violence (i.e., objective and/or subjective violence) adopted in the MediaEval 2013
VSD task. Penet et al. [ 29 ] propose to exploit temporal and multimodal informa-
tion for objective violence detection at video shot level. In order to model violence,
different kinds of Bayesian network structure learning algorithms are investigated.
The proposed method is tested on the dataset of the MediaEval 2011 VSD Task.
Experiments demonstrate that both multimodality and temporality add valuable
information into the system and improve the performance in terms of MediaEval
cost function [ 9 ]. The best-performing method achieves 50% false alarms and 3%
missed detection, ranking among the best submissions to the MediaEval 2011 VSD
task.
In [ 21 ], Ionescu et al. address the detection of objective violence in Hollywood
movies. The method relies on fusing mid-level concept predictions inferred from
low-level features. The mid-level concepts used in this work are gory scenes, pres-
ence of blood, firearms and cold weapons (for the visual modality); presence of
screams and gunshots (for the audio modality); and car chases, presence of explo-
sions, fights, and fire (for the audio-visual modalities). The authors employ a bank
of multilayer perceptrons featuring a dropout training scheme in order to construct
10 violence-related concept classifiers. The predictions of these concept classifiers
are then merged to construct the final violence classifier. The method is tested on the
dataset of the MediaEval 2012 VSD task and ranked first among 34 other submis-
sions, in terms of precision and F-measure.
In [ 16 ], Goto and Aoki propose a violence detection method which is based on
the combination of visual and audio features extracted at the segment level, using
machine learning techniques. Violence detection models are learned via multiple
kernel learning. The authors also propose mid-level violence clustering in order to
implicitly learn mid-level concepts without using manual annotations. The proposed
method is trained and evaluated on the MediaEval 2013 VSD task using the official
MediaEval metric Mean Average Precision at 100 (MAP@100). The results show
Search WWH ::




Custom Search