Information Technology Reference
In-Depth Information
representations are based on BoW where we first extract audio features (MFCC)
and visual features (dense HoG and HoF), and subsequently apply sparse coding on
each feature descriptor separately. We used ViF and affect-related color features as
low-level visual representation of videos.
Since “violence” is a very diverse concept, we first performed feature space par-
titioning through clustering video segments instead of learning a unique violence
detection model. We then learned a different model for each violence subconcept. In
order to combine the classification results of the violence models in the test phase,
we performed a classifier selection. More specifically, we labeled a video segment
with the output of the classifier whose cluster center is closest to the video segment
in terms of Euclidean distance.
To demonstrate the wide applicability of our solution, we evaluated our method
on two different datasets: one dataset of Hollywood movies and one dataset of user-
generated videos.
We showed that the mid-level audio representation based on MFCC and sparse
coding provides very promising performance in terms of AP and AP@100 metrics
and also outperforms visual representations that we used in this work. We also fused
these mid-level audio cues with low and mid-level visual cues at the decision level
using linear fusion for further improvement and achieved better results than uni-
modal video representations in terms of the AP metrics.
Different from Hollywood movies, user-generated videos are more challenging,
since they are not professionally edited, e.g., in order to enhance dramatic scenes.
We also demonstrated the performance of our system on the challenging web video
dataset which contains short web videos from YouTube. The evaluation results on
the short web videos were similar to the ones on the Hollywood movie dataset and
hence, showed that our violence detection method generalizes well to different types
of video content.
We observed from the overall evaluation results that our method performs better
when violent content is better expressed in terms of audio features. Hence, as a future
work, we need to extend/improve our visual representation set with more discrim-
inative representations. Another possibility for future work is to further investigate
the feature space partitioning concept and optimize the distribution or number of
subconcepts in order to enhance the classification performance of our method.
Acknowledgments The research leading to these results has received funding from the European
Community FP7 under grant agreement number 261743 (NoE VideoSense). We would like to thank
Technicolor ( http://www.technicolor.com/ ) for providing the ground truth, video shot boundaries,
and the corresponding keyframes which have been used in this work. Our thanks also go to Fudan
University and Vietnam University of Science for providing the ground truth of the Web video
dataset.
Search WWH ::




Custom Search