Image Processing Reference
2 Related work
There are different approaches followed for the representation of temporal video segments for
content-based video information retrieval problems such as video action recognition, event
al world and bring us facts about how to detect the visual features and in which context
more philosophically. Regarding the visual features, mentioned approaches can generally be
igured out. Key-frame-, bag-of-words- (BoW), interest points-, and motion-based approaches
are the groups of approaches reflecting the way of representation.
Key-frame-based representation approaches focus on detecting key frames in the video seg-
ments in order to use them in classification. This kind of representation is used in Refs. [ 5 - 8 ]
for video scene detection and video summarization. The study of Vasileios et al. [ 5 ] contains
the segmentation of videos into shots, and key-frames are extracted from these shots. In or-
der to overcome the difficulty of having prior knowledge of the scene duration, the shots are
assigned to visual similarity groups. Then, each shot is labeled according to its group and
a sequence alignment algorithm is applied for recognizing the shot labels change patterns.
Shot similarity is estimated using visual features and shot orders are kept while applying se-
is presented with keywords among the words representing concepts or objects needed in
content-based image retrieval. Key-frame-based approach is used for videos. Images are rep-
resented as the vectors of feature vectors containing visual features such as color and edge.
They are modeled by a hidden Markov model, whose states represent concepts. Model para-
annotation and retrieval for videos using key frames. They propose a new approach automat-
ically annotating video shots with semantic concepts. Then, the retrieval carried out by textu-
al queries. An efficient method extracting semantic candidate set of video shots is presented
key frame extraction is proposed. The method is used for video summarization. Metrics are
proposed for measuring the quality.
Histogram-based BoW approaches represent the frames of the video segments over a vocab-
ulary of visual features. References [ 9 , 10 ] are the examples of such approaches. Ballan et al. [ 9 ]
propose a method interpreting temporal information with the BoW approach. Video events
are conceptualized as vectors composed of histograms of visual features, extracted from the
video frames using BoW model. The vectors, in fact, can be behaved as the sequences, like
strings, in which histograms are considered as characters. Classification of these sequences
having difference in length, depending on the video scene length, is carried out by using
SVM classifiers with a string kernel that uses the Needlemann-Wunsch edit distance. In Ref.
Visual-Words (ERMH-BoW) implementing motion relativity and visual relatedness needed in
event detection. Concerning the ERMH-BoW feature, relative motion histograms are formed
between visual words representing the object activities and events.
Despite their performance issues in terms of time, above approaches lack the flow features
and temporal semantics of motion although they are efficient in spatial level. On the oftheir
hand, motion-based approaches deal with motion features which are important in terms of
their strong information content and stability over spatiotemporal visual changes. Motion fea-
tures such as interest points and optical flow are used for modeling temporal video segments.
References [ 11 - 14 ] are the studies using motion features. Ngo et al. [ 11 ] propose a new frame-
work in order to group the similar shots into one scene. Motion characterization and back-