Information Technology Reference
In-Depth Information
situations on which it properly works is difficult. Video scenes are divided into video
shots, where each scene is formulated as a bag and each shot as an instance inside the
bag for MIL. Color and texture features are used for the visual representation of video
shots, while MFCCs are used for the audio representation. More specifically, mean,
variance, and first-order differential of each dimension of MFCCs are employed for
the audio representation. As observed from their results [ 5 ], using color and textural
information in addition to MFCC features slightly improves the performance.
Giannakopoulos et al. [ 14 ], in an attempt to extend their approach based solely on
audio cues [ 13 ], propose to use amultimodal two-stage approach. In the first step, they
perform audio and visual analysis of segments of one-second duration. In the audio
analysis part, audio features such as energy entropy, ZCR, and MFCCs are extracted
and the mean and standard deviation of these features are used to classify scenes into
one of seven classes (violent ones including shots, fights and screams). In the visual
analysis part, average motion, motion variance, and average motion of individuals
appearing in a scene are used to classify segments as having either high or lowactivity.
The classifications obtained in this first step are then used to train a k -NN classifier.
In [ 15 ], a three-stage method is proposed. In the first stage, the authors apply a
semi-supervised cross-feature learning algorithm [ 37 ] on the extracted audio-visual
features such as motion activity, ZCR, MFCCs, pitch, and rhythm features for the
selection of candidate violent video shots. In the second stage, high-level audio
events (e.g., screaming, gun shots, explosions) are detected via SVM training for
each audio event. In the third stage, the outputs of the classifiers generated in the
previous two stages are linearly weighted for final decision. Although not explicitly
stated, the authors define violent scenes as those which contain action and violence-
related concepts such as gunshots, explosions, and screams. The method was only
evaluated on action movies. However, violent content can be present in movies of
all genres (e.g., drama). The performance of this method in genres other than action
is, therefore, unclear.
Lin and Wang [ 24 ] train separate classifiers for audio and visual analysis and
combine these classifiers by co-training. Probabilistic latent semantic analysis is
applied in the audio classification part. Spectrumpower, brightness, bandwidth, pitch,
MFCCs, spectrum flux, ZCR, and harmonicity prominence features are extracted.
An audio vocabulary is subsequently constructed by k -means clustering. Audio clips
of one-second length are represented by the audio vocabulary. This method also con-
structs mid-level audio representations with a technique derived from text analysis.
However, this approach presents the drawback of only constructing a dictionary of
20 audio words, which prevents having a precise representation of the audio signal
of video shots. In the visual classification part, the degree of violence of a video shot
is determined by using motion intensity, the (non-)existence of flame, explosion, and
blood appearing in the video shot. Violence-related concepts studied in this work
are fights, murders, gunshots, and explosions. This method was also evaluated only
on action movies. Therefore, the performance of this solution in genres other than
action is uncertain.
Chen et al. [ 6 ] proposed a two-phase solution. According to their violence
definition, a violent scene is a scene that contains action and blood. In the first phase,
Search WWH ::




Custom Search