Image Processing Reference
In-Depth Information
1 Introduction
Video action recognition is a field of multimedia research enabling us to recognize the actions
from a number of observations. The observations on video frames depend on the video fea-
tures derived from different sources. While textual features include high-level semantic in-
formation, they cannot be automated. The recognition strongly depends on the textual sources
which are commonly created manually. On the other hand, audio features are restricted to a
supervisor role. As the audio does not contain strong information showing the actions con-
ceptually, it can be used as an additional resource supporting visual and textual information.
Visual video features provide the basic information for the video events or actions. Although
it is difficult to obtain high levels of semantics by using visual information, a convincing way
to construct an independent fully automated video annotation or action recognition model is
to utilize visual information as the central resource. This way takes us to content-based video
information retrieval.
Content-based video information retrieval is the automatic annotation and retrieval of con-
ceptual video items such as objects, actions, and events using the visual content obtained from
video frames. There are various methods to extract visual features and use them for different
purposes. The visual feature sets they use vary from static image features (pixel values, col-
or histograms, edge histograms, etc.) to temporal visual features (interest point flows, shape
descriptors, motion descriptors, etc.). Temporal visual features combine the visual image fea-
tures with the time information. Representing video information using temporal visual fea-
tures generically means modeling the visual video information with temporal dimension, i.e.,
constructing temporal video information.
We need to represent the temporal video information formally for developing video action
recognition methods. Visual features such as corners and visual interest points of video frames
are the basics for constructing our model. These features are used for constructing a more
complicated motion feature, namely, optical flow. In our work, we propose a new tempor-
al video segment representation method to retrieve video actions for formalizing the video
scenes as temporal information. The representation is fundamentally based on the optical flow
vectors calculated for the frequently selected frames of the video scene. Weighted frame velo-
city concept is put forward for a whole video scene together with the set of optical flow vec-
tors. The combined representation is used in the action-based temporal video segment classi-
ication. The result of the classification represents the recognized actions.
The main contribution of this work is the proposed temporal video segment representation
method. It is aimed to be a generic model for temporal video segment classification for action
recognition. The representation is based on optical flow concept. It uses the common way of
partitioning optical flow vectors according to their angular features. An angular grouping of
optical flow vectors is used for each selected frame of the video. We propose the novel concept
of weighted frame velocity as the velocity of the cumulative angular grouping of a temporal
video segment in order to represent the motion of the frames of the segments more descript-
ively.
The outline of this article is as follows. In Section 2 , related work is proposed. Section 3 dis-
cusses the temporal segment representation. Optical flow is described in Section 4 and optical
low-based segment representation is discussed in Section 5 . The inspiration of representation
in cut detection is given in Section 6 . In Section 7 , experiments and results are presented. Fin-
ally, in Section 8 , the conclusion is proposed.
Search WWH ::




Custom Search