Information Technology Reference
In-Depth Information
We use texture memory in kernel “Warping”. With the bilinear filter function by
CUDA, we only need to set the filter mode parameter to bilinear. When fetching the
texture memory, the returned value is computed automatically based on the input coor-
dinates with the bilinear filter. This hardware function helps us to skip its programming.
5.2
Memory Coalescing
By using memory coalescing in CUDA, a half warp of 16 GPU threads can finish 16
global data fetching in as few as 1 or 2 transactions. In our applications, we has inten-
sively used this optimization technique. For example, to calculate the mean I in ZNCC
correlation of a 360
360 region, first we need to calculate the sum of I .Weuse1block
of 512 threads (the index of each thread “threadID” is from 0 to 511) to accumulate all
the 129600 pixels. As 129600
×
60, the number of data processed by each
thread is 254 (except the last two threads with only 60 data). The normal idea is using
“for-loop” in each GPU thread like this:
=
254
510
+
for
(
j
=
threadID
254; j < (
threadID
+
1
)
254; j
++) {
sum
+=
I
[
j
]
;
}
To fully use memory coalescing, we change the code to follows:
for
(
j
=
threadID ; j < 129600; j
+=
512
) {
sum
+=
I
[
j
]
;
}
Both “for-loops” seem to have same performance for a GPU thread. But due to GPU's
particular memory fetching mechanism, speedup really happens on GPU.
GPU memory is accessed in a continuous block mode, i.e. during one GPU memory
access, data from a block of continuously addressing memory space will be loaded si-
multaneously. For example, it can load T
simultaneously by 16 GPU threads.
In the latter loop, the fetched 16 data can be parallel processed by 16 GPU threads.
Meanwhile in the former loop, only 1 data of these 16 data is used by 1 thread while all
other 15 data is deserted. Each of other 15 threads must invoke other 15 GPU memory
access to fetch their own data. Therefore, for the same data fetching, the former loop
costs about 15 times more memory access time than the latter loop. With memory coa-
lescing strategy shown in the latter loop, we have substantially reduced the total number
of running time.
[
0
]
T
[
15
]
6
Conclusions
In this paper, an efficient combination approach of GPU-ESM and GPU-SIFT is pre-
sented. Experimental results verified the efficiency and effectiveness of our approach.
The optimization techniques in our implementations are presented as a reference for
other GPU application developers.
References
1. Szeliski, R.: Handbook of Mathematical Models in Computer Vision, pp. 273-292. Springer,
Heidelberg (2006)
Search WWH ::




Custom Search