Information Technology Reference
In-Depth Information
Recognizing Human Actions by Using
Spatio-temporal Motion Descriptors
Akos Utasi and Andrea Kovacs
Hungarian Academy of Sciences, Computer and Automation Research Institute
Distributed Events Analysis Research Group
Kende u. 13-17. H-1111 Budapest, Hungary
{ utasi,andrea.kovacs } @sztaki.hu
Abstract. This paper presents a novel tool for detecting human actions
in stationary surveillance camera videos. In the proposed method there
is no need to detect and track the human body or to detect the spatial or
spatio-temporal interest points of the events. Instead our method com-
putes single-scale spatio-temporal descriptors to characterize the action
patterns. Two different descriptors are evaluated: histograms of optical
flow directions and histograms of frame difference gradients. The inte-
gral video method is also presented to improve the performance of the
extraction of these features. We evaluated our methods on two datasets:
a public dataset containing actions of persons drinking and a new dataset
containing stand up events. According to our experiments both detec-
tors are suitable for indoor applications and provide a robust tool for
practical problems such as moving background, or partial occlusion.
Keywords: Human action recognition, optical flow, frame difference.
1
Introduction
In the last decade human action detection and recognition in video streams have
been an active field of research. They can often be a prerequisite for applications
such as visual surveillance, semantic video annotation/indexing and retrieval, or
higher level video analysis. It is still a challenging problem due to the variations
in body size and shape, clothing, or the diverse characteristic (e.g. velocity, gait,
posture) of the actions performed by different actors. The environmental noise
(e.g. illumination change, shadows, occlusion, moving or cluttered background)
also increases the complexity of the problem.
Several methods have been developed for detecting objects (e.g. human body,
face, vehicle) in static images, and some of the concepts have been extended for
recognizing action in video sequences. Most of these methods rely on the sparsely
detected interest points and features extracted at the location of these points.
Our approach is also inspired by object detection approaches, but contrary to
other methods we neglect the interest points, instead we create a dense grid of
local statistics in a predefined size spatio-temporal window containing the whole
 
Search WWH ::




Custom Search