Spatio-temporal Dynamic Texture Descriptors for Human Motion Recognition - Intelligent Video Event Analysis and Understanding

Information Technology Reference

In-Depth Information

understanding human actions by computers is to obtain robust action recog-

nition under variable illumination, background changes, camera motion and

zooming, viewpoint changes and partial occlusions. Moreover, the system has

to cope with a high intraclass variability: the actions can be performed by

people wearing different clothes, having different postures and size.

Different approaches in the field of human action recognition have been pro-

posed and developed in the literature: holistic and part-based representations.

Holistic representations focus on the whole human body trying to search

characteristics such as contours or pose. Usually holistic methods, which fo-

cus on the contours of a person, do not consider the human body as being

composed of body parts but consider the whole form of human body in the

analyzed frame. Efros et al. [7] use cross-correlation between optical flow

descriptors in low resolution videos. However, subjects must be tracked and

stabilized and if the background is non uniform, a figure-ground segmentation

is required. Bobick et al. [4] use motion history images that capture motion

and shape to represent actions. They introduced the global descriptors motion

energy image and motion history image. However, their method depends on

background subtraction. This method has been extended by Weinland et al.

[20]. Shechtman et al. [18] use similarity between space-time volumes which

allows finding similar dynamic behaviors and actions, but can not handle

large geometric variation between intra-class samples, moving cameras and

non-stationary backgrounds.

Motion and trajectories are also commonly used features for recognizing

human actions and this could be defined as pose estimation in holisic ap-

proaches. Ramanan and Forsyth [15] tracks body parts and then use the ob-

tained motion trajectories to perform action recognition. In particular, they

track the humans in the sequences using a structure procedure and then 3D

body configurations are estimated and compared to a highly annotated 3D

motion library. Multiple cameras and 4D trajectories are used by Yilmaz et

al. [21] to recognizing human actions in videos acquired by uncalibrated and

moving cameras. They proposed to extend the standard epipolar geometry

to the geometry of dynamic scenes and showed the versatility of such method

for recognizing of actions in challenging sequences. Ali et al. [3] use trajecto-

ries of hands, feet and body. The human body is modelled from experimental

data as a nonlinear and chaotic system.

Holistic methods may depend on the recording conditions such as position

of the pattern in the frame, spatial resolution, relative motion with respect

to the camera and can be influenced by variations in the background and

by occlusions. These problems can be solved in principle by external mech-

anisms (e.g. spatial segmentation, camera stabilization, tracking etc.), but

such mechanisms might be unstable in complex situations and require more

computational demand.

Part-based representations typically search for Space-Time Interest Points

(STIPs) in the video, apply a robust description of the area around them and

create a model based on independent features (Bag of Words) or a model

Intelligent Video Event Analysis and Understanding

Search WWH ::

Custom Search

Home