Detecting Violent Content in Hollywood Movies and User-Generated Videos - Smart Information Systems: Computational Intelligence for Real-Life Applications

Information Technology Reference

In-Depth Information

The VLFeat 8 open source library is used to perform k -means clustering ( k

=

10

in this work).

We trained the two-class SVMs with an RBF kernel using libsvm 9 as the SVM

implementation. Training was performed using audio and visual features extracted at

the video segment level. SVMparameterswere optimized by fivefold cross-validation

on the training data.

11.4.3 Evaluation Metrics

We used two different evaluation metrics in our evaluation: (1) average precision at

100 ( AP@100 ) which is the official metric used in the MediaEval 2013 VSD task,

and (2) average precision ( AP ) which is the official metric used in the MediaEval

2014 VSD task. Although the AP@100 metric is no longer the official metric of the

MediaEval VSD task, we think that providing a ranked list of violent video shots

to the user is still important for our use case. Additionally, including the AP@100

metric allows a comparison with potential other works which would present their

results based on AP@100 solely.

11.4.4 Results and Discussion

Table 11.2 reports the mean AP and AP@100 metrics on the Hollywood movie

dataset. We observe that the mid-level audio representation based on MFCC and

sparse coding provides promising performance in terms of average precision and

outperforms all other representations that we use in this work. We also note that the

performance is further improved by fusing these mid-level audio cues with low and

mid-level visual cues at the decision level by linear fusion.

Table 11.3 reports the mean AP and AP@100 metrics on the Web video dataset.

We observe results which are similar to the ones obtained on the Hollywood movie

dataset (Table 11.2 ). We used the same violence detection models which were trained

using the 17 Hollywood movies, and evaluated these models on short Web videos.

The results in terms of AP and AP@100 are still encouraging and even demon-

strate superior results compared to the ones obtained on the Hollywood movie

dataset. Therefore, we can conclude that our violence detection method generalizes

particularly well to other types of video content not used for training the models.

Another interesting observation is that affect-related color features seem to provide

better results in terms of AP metrics on the Web video dataset in comparison to the

Hollywood movie dataset. One final remark is that the linear fusion of audio-visual

8 http://www.vlfeat.org/ .

9 http://www.csie.ntu.edu.tw/~cjlin/libsvm/ .

Search WWH ::

Custom Search

Home