Information Technology Reference
In-Depth Information
The VLFeat 8 open source library is used to perform k -means clustering ( k
=
10
in this work).
We trained the two-class SVMs with an RBF kernel using libsvm 9 as the SVM
implementation. Training was performed using audio and visual features extracted at
the video segment level. SVMparameterswere optimized by fivefold cross-validation
on the training data.
11.4.3 Evaluation Metrics
We used two different evaluation metrics in our evaluation: (1) average precision at
100 ( AP@100 ) which is the official metric used in the MediaEval 2013 VSD task,
and (2) average precision ( AP ) which is the official metric used in the MediaEval
2014 VSD task. Although the AP@100 metric is no longer the official metric of the
MediaEval VSD task, we think that providing a ranked list of violent video shots
to the user is still important for our use case. Additionally, including the AP@100
metric allows a comparison with potential other works which would present their
results based on AP@100 solely.
11.4.4 Results and Discussion
Table 11.2 reports the mean AP and AP@100 metrics on the Hollywood movie
dataset. We observe that the mid-level audio representation based on MFCC and
sparse coding provides promising performance in terms of average precision and
outperforms all other representations that we use in this work. We also note that the
performance is further improved by fusing these mid-level audio cues with low and
mid-level visual cues at the decision level by linear fusion.
Table 11.3 reports the mean AP and AP@100 metrics on the Web video dataset.
We observe results which are similar to the ones obtained on the Hollywood movie
dataset (Table 11.2 ). We used the same violence detection models which were trained
using the 17 Hollywood movies, and evaluated these models on short Web videos.
The results in terms of AP and AP@100 are still encouraging and even demon-
strate superior results compared to the ones obtained on the Hollywood movie
dataset. Therefore, we can conclude that our violence detection method generalizes
particularly well to other types of video content not used for training the models.
Another interesting observation is that affect-related color features seem to provide
better results in terms of AP metrics on the Web video dataset in comparison to the
Hollywood movie dataset. One final remark is that the linear fusion of audio-visual
8 http://www.vlfeat.org/ .
9 http://www.csie.ntu.edu.tw/~cjlin/libsvm/ .
Search WWH ::




Custom Search