Information Technology Reference
In-Depth Information
that can also contain structural information. These methods do not require
tracking and stabilization and are often more resistant to cluttering, as only
few parts may be occluded. The resulting features often reflect interesting
patterns that can be used for a compact representation of video data as
well as for interpretation of spatio-temporal events. Different methods for
detecting STIPs have been proposed, such as [10, 6].
This work is an extension of the work done in [12], by comparing several
methods with two different kind of learning methods.
The topic chapter is organized as follows. In Section 2 we provide the
methodology adopted for classification and in Section 3 we provide an in-
troduction to the LBP and LBP-TOP descriptors on 3D data. Experimental
results on human action recognition are shown and evaluated in Section 4.
Finally, we conclude in Section 5.
2 Methodology
In the following sub-sections we describe our algorithm in detail. In
Sub-section 2.1 we explain the classification scheme of our algorithm. In
Sub-section 2.2 we describe the detection phase of STIPs, while the feature
description phase is breafly introduced in Sub-section 2.3. Sub-section 2.4
explains the classifiers used for training the system.
2.1 Bag of Words Classification
The methodology used in this work is an extension of Bag of Words (BoW)
model to video sequences and it has been introduced by Dollar et al. [6]. As a
first step, the Space-Time Interest Points, which are the locations where inter-
esting motion is happening, are detected in a video sequence using a separable
linear filter. This phase is called Space-Time Interest Points detection. Small
video patches (also named as cuboids) are extracted from each STIP. They rep-
resent the local information used to learn and train the system for recognizing
the different human actions. Each cuboid is then described using the LBP-TOP
descriptor. This phase is named as Space-Time Interest Points description.
The result is a sparse representation of the video sequence as cuboid de-
scriptors; the original video sequence can be discarded.
In the training phase, a visual vocabulary (also named as codebook) is
built by clustering all the descriptors taken from all training videos. The
clustering is done using the k-means algorithm and the center of each cluster
is defined as a spatio-temporal 'word' whose length depends on the length of
the descriptor adopted. Each feature description is successively assigned to
the closest (we use Euclidean distance) vocabulary word and a histogram of
spatio-temporal words occurrence is computed for each training videos. Thus,
each video is represented as a collection of spatial-temporal words from the
codebook in the form of a histogram. The histograms are the data used
 
Search WWH ::




Custom Search