Fusion of Motion and Appearance for Robust People Detection in Cluttered Scenes - Intelligent Video Event Analysis and Understanding - page 113

Information Technology Reference

In-Depth Information

2

Methodology

In contrast to video sequences captured under well-controlled environment at frame

rate, our task for people detection requires to work in highly cluttered public scene

(underground) given low resolution data often at low frame rate. The scene also suf-

fers from (1) significant lighting changes, which makes the motion estimation un-

stable and noisy; (2) heavy occlusions, which requires the people detector to handle

partial match; (3) extensive background clutters, which can cause high false alarms.

To this end, we propose a robust people detection method for video sequences by

fusing static appearance feature based detector with a long-term motion based spa-

tial pyramid likelihood measure. An overview of our method is shown in Fig. 1.

Sliding

Window

Linear

SVM

HOG

Descriptor

Im age

Apperance

Bayesian

Ver ificat ion

Person

sequences

non-person

Motion

Motion

Modeling

Background

Modeling

Differencing

Pyramid

Fig. 1 Flow chart of our method for pedestrian detection. An appearance based detector is

used to create the initial hypothesis and long-term motion is modeled by the motion pyramid

approach. The above cues are combined in a Bayesian framework. The final candidates are

selected by thresholding.

2.1

Generating Hypothesis

We adopt a static people detector proposed by Dalal and Triggs [3] to generate static

human presence hypothesis in each frame. To achieve scale invariance, this detec-

tor utilizes a multi-scale sliding window approach, i.e. scanning each frame at each

scale level. Each sub-window image patch centered at location i (denoted by v i ,

where i

1: n and n is the number of patches) is transformed into a feature vector

before being classified into either human foreground or scene background by a clas-

sifier. The feature vector used here is a SIFT [16] like feature based on histogram

of gradient orientation. The basic idea is that local object appearance and shape can

often be characterized rather well by the distribution of local intensity gradients or

edge directions, even without any precise knowledge of corresponding gradient or

edge positions (similar work can be found in [21] using histograms of scale nor-

malized, oriented derivatives to detect and recognize arbitrary object classes). The

size of the detection window is 32

=

×

64 including 8 pixels of margin beyond the

window size. A linear SVM is used as the classifier and the output of the classifier

Next Page

Intelligent Video Event Analysis and Understanding

Search WWH ::

Custom Search

Home