Integrated Pedestrian Detection and Localization Using Stereo Cameras - Digital Signal Processing for In-Vehicle Systems and Safety

Digital Signal Processing Reference

In-Depth Information

samples contain a fixed number of 15,000 patches that randomly selected from

1,239 person-free images of that data set. The training returns a 3,255-dimensional

linear classifier (the size of 70 by 134 patch image's feature vector).

When a novel image emerges, we slide a window over the scales and positions to

find the hypotheses. For each subwindow, we evaluate a score by doing dot product

of the pretrained linear model and feature vector of the image patch. If the score is

larger than the threshold, we either take it as a hypothesis or discard it. Typically,

for an image portion that is likely to be a pedestrian instance, the score for the boxes

around it will be very high. In order to eliminate any overlapped bounding boxes for

the same instance, we perform non-maxima suppression to select only one box for

each instance.

In this way, we get a set of hypotheses which is expected to have a pedestrian

instance, each one with a bounding box and a classification score. However, the

classification score is within the interval

ð

1; þ1

Þ

. Since our graphical model

wants a probabilistic input Po i j

which should be in the interval (0, 1), we

therefore transform the SVM output into a probability form with logistic regression:

ðÞ

I

1

P

¼

(16.4)

1

þ

e AxþB

where x is the classification score output from the dot product, P is the corres-

ponding probability form of the score, and A and B are parameters which could be

estimated by collecting a set of x and p. With novel classification score x 0 , we take

the corresponding p 0 as Po i j

ðÞ

I

.

16.4 Localization of Pedestrian Instance

The use of a descriptor-based matching approach to obtain a sparse depth map

distinguishes our work from the previous studies on how to estimate depth in a

dense way. Though it could only provide a sparse representation of the scene, it is

less ambiguous than dense matching which suffers from occlusion and nontexture

regions. To make the depth map not “too sparse,” we use two different kinds of key

points as in [ 8 ] to relate the stereo images (Fig. 16.3 ).

We extract scale-invariant key points using Difference-of-Gaussian operator [ 10 ]

and corner key point with Harris operator. For the scale-invariant key points, we

utilize a GPU implementation of SIFT to compute their descriptors and match them

by measuring the Euclidean distance. This implementation benefits from the

Nvidia's CUDA technology and can get a speed of 25 Hz when processing images

with size 640 by 480, which we think is enough for general real-world applications.

The corner points are matched with a correlation window by normalized cross-

correlation. Using two kinds of key points could help establish sufficient raw

correspondences fast.

Digital Signal Processing for In-Vehicle Systems and Safety

Search WWH ::

Custom Search

Home