Optical flow-based representation for video action detection - Emerging Trends in Image Processing, Computer Vision, and Pattern Recognition

Image Processing Reference

In-Depth Information

2 Related work

There are different approaches followed for the representation of temporal video segments for

content-based video information retrieval problems such as video action recognition, event

detection, cut detection, etc. The studies in Refs. [ 1 - 4 ] focus on the perception of the visu-

al world and bring us facts about how to detect the visual features and in which context

more philosophically. Regarding the visual features, mentioned approaches can generally be

igured out. Key-frame-, bag-of-words- (BoW), interest points-, and motion-based approaches

are the groups of approaches reflecting the way of representation.

Key-frame-based representation approaches focus on detecting key frames in the video seg-

ments in order to use them in classification. This kind of representation is used in Refs. [ 5 - 8 ]

for video scene detection and video summarization. The study of Vasileios et al. [ 5 ] contains

the segmentation of videos into shots, and key-frames are extracted from these shots. In or-

der to overcome the difficulty of having prior knowledge of the scene duration, the shots are

assigned to visual similarity groups. Then, each shot is labeled according to its group and

a sequence alignment algorithm is applied for recognizing the shot labels change patterns.

Shot similarity is estimated using visual features and shot orders are kept while applying se-

quence alignment. In Ref. [ 6 ] , a novel method for automatic annotation of images and videos

is presented with keywords among the words representing concepts or objects needed in

content-based image retrieval. Key-frame-based approach is used for videos. Images are rep-

resented as the vectors of feature vectors containing visual features such as color and edge.

They are modeled by a hidden Markov model, whose states represent concepts. Model para-

meters are estimated from a training set. The study proposed in Ref. [ 7 ] deals with automatic

annotation and retrieval for videos using key frames. They propose a new approach automat-

ically annotating video shots with semantic concepts. Then, the retrieval carried out by textu-

al queries. An efficient method extracting semantic candidate set of video shots is presented

based on key frames. Extraction uses visual features. In Ref. [ 8 ] , an innovative algorithm for

key frame extraction is proposed. The method is used for video summarization. Metrics are

proposed for measuring the quality.

Histogram-based BoW approaches represent the frames of the video segments over a vocab-

ulary of visual features. References [ 9 , 10 ] are the examples of such approaches. Ballan et al. [ 9 ]

propose a method interpreting temporal information with the BoW approach. Video events

are conceptualized as vectors composed of histograms of visual features, extracted from the

video frames using BoW model. The vectors, in fact, can be behaved as the sequences, like

strings, in which histograms are considered as characters. Classification of these sequences

having difference in length, depending on the video scene length, is carried out by using

SVM classifiers with a string kernel that uses the Needlemann-Wunsch edit distance. In Ref.

[ 10 ] , a new motion feature is proposed, Expanded Relative Motion Histogram of Bag-of-

Visual-Words (ERMH-BoW) implementing motion relativity and visual relatedness needed in

event detection. Concerning the ERMH-BoW feature, relative motion histograms are formed

between visual words representing the object activities and events.

Despite their performance issues in terms of time, above approaches lack the flow features

and temporal semantics of motion although they are efficient in spatial level. On the oftheir

hand, motion-based approaches deal with motion features which are important in terms of

their strong information content and stability over spatiotemporal visual changes. Motion fea-

tures such as interest points and optical flow are used for modeling temporal video segments.

References [ 11 - 14 ] are the studies using motion features. Ngo et al. [ 11 ] propose a new frame-

work in order to group the similar shots into one scene. Motion characterization and back-

Emerging Trends in Image Processing, Computer Vision, and Pattern Recognition

Search WWH ::

Custom Search

Home