A New Approach of GPU Accelerated Visual Tracking - Advanced Concepts for Intelligent Vision Systems

Information Technology Reference

In-Depth Information

We use texture memory in kernel “Warping”. With the bilinear filter function by

CUDA, we only need to set the filter mode parameter to bilinear. When fetching the

texture memory, the returned value is computed automatically based on the input coor-

dinates with the bilinear filter. This hardware function helps us to skip its programming.

5.2

Memory Coalescing

By using memory coalescing in CUDA, a half warp of 16 GPU threads can finish 16

global data fetching in as few as 1 or 2 transactions. In our applications, we has inten-

sively used this optimization technique. For example, to calculate the mean I in ZNCC

correlation of a 360

360 region, first we need to calculate the sum of I .Weuse1block

of 512 threads (the index of each thread “threadID” is from 0 to 511) to accumulate all

the 129600 pixels. As 129600

×

60, the number of data processed by each

thread is 254 (except the last two threads with only 60 data). The normal idea is using

“for-loop” in each GPU thread like this:

=

254

∗

510

+

for

(

j

=

threadID

∗

254; j < (

threadID

+

1

) ∗

254; j

++) {

sum

+=

I

[

j

]

;

}

To fully use memory coalescing, we change the code to follows:

for

(

j

=

threadID ; j < 129600; j

+=

512

) {

sum

+=

I

[

j

]

;

}

Both “for-loops” seem to have same performance for a GPU thread. But due to GPU's

particular memory fetching mechanism, speedup really happens on GPU.

GPU memory is accessed in a continuous block mode, i.e. during one GPU memory

access, data from a block of continuously addressing memory space will be loaded si-

multaneously. For example, it can load T

simultaneously by 16 GPU threads.

In the latter loop, the fetched 16 data can be parallel processed by 16 GPU threads.

Meanwhile in the former loop, only 1 data of these 16 data is used by 1 thread while all

other 15 data is deserted. Each of other 15 threads must invoke other 15 GPU memory

access to fetch their own data. Therefore, for the same data fetching, the former loop

costs about 15 times more memory access time than the latter loop. With memory coa-

lescing strategy shown in the latter loop, we have substantially reduced the total number

of running time.

[

0

] ∼

T

[

15

]

6

Conclusions

In this paper, an efficient combination approach of GPU-ESM and GPU-SIFT is pre-

sented. Experimental results verified the efficiency and effectiveness of our approach.

The optimization techniques in our implementations are presented as a reference for

other GPU application developers.

References

1. Szeliski, R.: Handbook of Mathematical Models in Computer Vision, pp. 273-292. Springer,

Heidelberg (2006)

Advanced Concepts for Intelligent Vision Systems

Search WWH ::

Custom Search

Home