Fusion of Motion and Appearance for Robust People Detection in Cluttered Scenes - Intelligent Video Event Analysis and Understanding

Information Technology Reference

In-Depth Information

work in human detection (see [7] and [2] for a survey). These work can be broadly

categorized into two groups: static and dynamic people detectors. Static people de-

tectors rely mainly on finding robust appearance features that allow human form to

be discriminated against a cluttered background using a classifier such as SVM or

AdaBoost searching through a set of sub-images by a sliding window. Typical fea-

tures include rectified Haar wavelets [17], rectangular features [23], and SIFT (Scale

Invariant Feature Transform) like features such as histogram of oriented gradients

[16, 3]. Papageorgiou et al. [17] described a pedestrian detector based on SVM us-

ing Harr wavelet features. Gavrila and Philomin [6] presented a real-time pedestrian

detection system by utilizing silhouettes information extracted from edge images.

The candidate of the silhouettes is selected as the one with the smallest chamfer

distance to a set of learned human shape examples. On the other hand, there is little

progress on dynamic detectors, although the idea of using pure motion information

for human pattern recognition is not new [11, 9, 20]. Most existing work utilises

optic flow. Viola et al. [23] proposed a very efficient detector using AdaBoost that

can achieve real-time performance. The rather simple rectangular features and the

cascade structure account for the efficiency of this approach. Motion information

was also taken into account through a coarse estimation of optic flow between two

consecutive frames. Similar work of using optic flow for people detection can be

found in [4]. To achieve satisfactory performance, this approach assumes that the

human motion information in the test sequences is similar to those in the training

set. Other related work using motion information includes human behavior recog-

nition by distribution of 3D spatial-temporal interest points [22, 13], 3D volumetric

features [12], or through 3D correlation [1]. Overall, existing methods for comput-

ing motion assume mostly that the motion is locally smooth. However this is untrue

especially in busy public scenes when measuring optic flow is sensitive to noise and

unreliable due to lighting change, reflection, moving background such as tree leaves

(see Fig. 2).

To date, work on utilising both motion and appearance information remains in its

infancy. To our best knowledge, there is little work performing direct people detec-

tion using both appearance and long-term motion information, whilst our previous

work [25] has show some promising detection results using long-term motion score.

In this work, we present a robust framework for people detection in highly cluttered

public scenes by utilizing both human appearance and their long-term motion infor-

mation whilst reliable optic flow cannot be estimated. We further introduce a spatial

pyramid Gaussian Mixture approach to effectively model the variations of long-term

motion information which takes into the account of local geometric constrains, and

shows slightly better results than just using pure motion score [25]. Our method does

not require the estimation of continuous motion such as optic flow in training thus

reduces the number of features required for training a classifier. It allows for any

detected appearance hypothesis to be verified using long-term motion history anal-

ysis. We show experimental results to demonstrate the efficiency and robustness of

the proposed approach against that of a state of the art static people detector.

Intelligent Video Event Analysis and Understanding

Search WWH ::

Custom Search

Home