Digital Signal Processing Reference
In-Depth Information
samples contain a fixed number of 15,000 patches that randomly selected from
1,239 person-free images of that data set. The training returns a 3,255-dimensional
linear classifier (the size of 70 by 134 patch image's feature vector).
When a novel image emerges, we slide a window over the scales and positions to
find the hypotheses. For each subwindow, we evaluate a score by doing dot product
of the pretrained linear model and feature vector of the image patch. If the score is
larger than the threshold, we either take it as a hypothesis or discard it. Typically,
for an image portion that is likely to be a pedestrian instance, the score for the boxes
around it will be very high. In order to eliminate any overlapped bounding boxes for
the same instance, we perform non-maxima suppression to select only one box for
each instance.
In this way, we get a set of hypotheses which is expected to have a pedestrian
instance, each one with a bounding box and a classification score. However, the
classification score is within the interval
ð
1; þ1
Þ
. Since our graphical model
wants a probabilistic input Po i j
which should be in the interval (0, 1), we
therefore transform the SVM output into a probability form with logistic regression:
ðÞ
I
1
P
¼
(16.4)
1
þ
e AxþB
where x is the classification score output from the dot product, P is the corres-
ponding probability form of the score, and A and B are parameters which could be
estimated by collecting a set of x and p. With novel classification score x 0 , we take
the corresponding p 0 as Po i j
ðÞ
I
.
16.4 Localization of Pedestrian Instance
The use of a descriptor-based matching approach to obtain a sparse depth map
distinguishes our work from the previous studies on how to estimate depth in a
dense way. Though it could only provide a sparse representation of the scene, it is
less ambiguous than dense matching which suffers from occlusion and nontexture
regions. To make the depth map not “too sparse,” we use two different kinds of key
points as in [ 8 ] to relate the stereo images (Fig. 16.3 ).
We extract scale-invariant key points using Difference-of-Gaussian operator [ 10 ]
and corner key point with Harris operator. For the scale-invariant key points, we
utilize a GPU implementation of SIFT to compute their descriptors and match them
by measuring the Euclidean distance. This implementation benefits from the
Nvidia's CUDA technology and can get a speed of 25 Hz when processing images
with size 640 by 480, which we think is enough for general real-world applications.
The corner points are matched with a correlation window by normalized cross-
correlation. Using two kinds of key points could help establish sufficient raw
correspondences fast.
Search WWH ::




Custom Search