Information Technology Reference
In-Depth Information
should not be 17 or under 17, while, for instance, FSK 16 means audience should be
at least 16. This can also be a source of confusion among consumers.
The degree of violence one is able or willing to bear might vary strongly even
within a group of persons of identical age. That is probably why parents should get
from Technicolor information which is not limited to rating only but also a preview
of the most violent scenes, in order to help them decide if the movie is adequate to
be watched by their child.
Next to the issue of definition, another important step in the task of movie violent
content detection is the representation of movie segments. Many of the existing
works (e.g., [ 5 , 14 ]) proposed for violence detection represent videos using low-level
representations, especially for the representation of audio signals. Inferring abstract
representations is more suitable than directly using low-level features in order to
bridge the semantic gap between the features and high-level human perception of
violence. However, high-level semantics are more difficult to detect and state-of-
the-art detectors are far from perfect. Therefore, the use of mid-level representations
may help modeling video segments one step closer to human perception.
This chapter aims at assessing the discriminative power of mid-level audio-visual
features to model violence in Hollywood movies. We also investigate the effects
of combining mid-level audio-visual features with low-level audio-visual features
for the detection of violent content and show that promising results are obtained by
fusing audio-visual cues at the decision level.
The chapter is organized as follows. Section 11.2 explores the recent developments
and reviews methods which have been proposed in the literature in order to detect
violence in movies. In Sect. 11.3 , we introduce our method and the functioning of its
various components. We provide and discuss evaluation results obtained on Holly-
wood movies in Sect. 11.4 . In Sect. 11.5 , we present our browser-based visualization
tool which provides an intuitive way of using our solution. Concluding remarks and
future directions to expand our current approach are presented in Sect. 11.6 .
11.2 Related Work
Although video content analysis has been studied extensively in the literature, vio-
lence analysis of movies or of user-generated videos is restricted to a few studies
only. We discuss here some of the most representative ones which use audio and/or
visual cues. A difficulty arises regarding the definition of violence. In some of the
works presented in this section, the authors do not explicitly state their definition
of violence. In addition, nearly all papers in which the concept is defined consider
a different definition of violence; therefore, whenever possible, we also specify the
definition adopted in each work discussed in this section.
First, we briefly discuss uni-modal (i.e., based exclusively on the audio or visual
modality) violence detection methods. Giannakopoulos et al. [ 13 ] define violent
scenes as those containing shots, explosions, fights and screams, whereas nonviolent
content corresponds to audio segments containing music and speech. Frame-level
Search WWH ::




Custom Search