Spatio-temporal Dynamic Texture Descriptors for Human Motion Recognition - Intelligent Video Event Analysis and Understanding

Information Technology Reference

In-Depth Information

that can also contain structural information. These methods do not require

tracking and stabilization and are often more resistant to cluttering, as only

few parts may be occluded. The resulting features often reflect interesting

patterns that can be used for a compact representation of video data as

well as for interpretation of spatio-temporal events. Different methods for

detecting STIPs have been proposed, such as [10, 6].

This work is an extension of the work done in [12], by comparing several

methods with two different kind of learning methods.

The topic chapter is organized as follows. In Section 2 we provide the

methodology adopted for classification and in Section 3 we provide an in-

troduction to the LBP and LBP-TOP descriptors on 3D data. Experimental

results on human action recognition are shown and evaluated in Section 4.

Finally, we conclude in Section 5.

2 Methodology

In the following sub-sections we describe our algorithm in detail. In

Sub-section 2.1 we explain the classification scheme of our algorithm. In

Sub-section 2.2 we describe the detection phase of STIPs, while the feature

description phase is breafly introduced in Sub-section 2.3. Sub-section 2.4

explains the classifiers used for training the system.

2.1 Bag of Words Classification

The methodology used in this work is an extension of Bag of Words (BoW)

model to video sequences and it has been introduced by Dollar et al. [6]. As a

first step, the Space-Time Interest Points, which are the locations where inter-

esting motion is happening, are detected in a video sequence using a separable

linear filter. This phase is called Space-Time Interest Points detection. Small

video patches (also named as cuboids) are extracted from each STIP. They rep-

resent the local information used to learn and train the system for recognizing

the different human actions. Each cuboid is then described using the LBP-TOP

descriptor. This phase is named as Space-Time Interest Points description.

The result is a sparse representation of the video sequence as cuboid de-

scriptors; the original video sequence can be discarded.

In the training phase, a visual vocabulary (also named as codebook) is

built by clustering all the descriptors taken from all training videos. The

clustering is done using the k-means algorithm and the center of each cluster

is defined as a spatio-temporal 'word' whose length depends on the length of

the descriptor adopted. Each feature description is successively assigned to

the closest (we use Euclidean distance) vocabulary word and a histogram of

spatio-temporal words occurrence is computed for each training videos. Thus,

each video is represented as a collection of spatial-temporal words from the

codebook in the form of a histogram. The histograms are the data used

Search WWH ::

Custom Search

Home