Digital Signal Processing Reference
In-Depth Information
Kernels of the stream H.264 encoder are executed on a GPU in the form of thread
blocks. A thread is defined as an iteration of a kernel. Subblock level parallelism
and slice parallel techniques are applied on all threads to match the parallel processing
granularity and parallelism degree on a GPU. This paper takes kernel SAD computing
of the ME module as an example to illustrate how to map the stream code onto a
GPU, as shown in Figure 3.
In the example, the number of threads and thread blocks executed simultaneously
on GPU can be calculated according to the following formulas (the size of a thread
block is set to be 256 threads by experience, the size of a subblock is 4*4, different
rows of Macroblocks in a slice cannot be processed in parallel):
Thread_nums = (Frame_width/4)*(MB_height/4)*N*32*32 (1)
ThreadBlock_nums = Thread_nums/256
(2)
N is the number of slices of an image;
32*32 is the size of searching window.
Fig. 3. Mapping of kernel SAD computing on GPU
Obviously, the number of threads calculated by formula (1) constitutes one of sub-
block which can be processed in parallel for an image. For a full high definition video
sequence (resolution is 1920*1080), the number of threads and thread blocks ex-
ecuted simultaneously on a GPU are up to 1966080*N, 7680*N respectively. Since
there are 7 SMs on GTX460, each SM can be assigned with hundreds or even thou-
sands thread blocks. The number of active thread blocks on a SM is set as 4 at one
time (the total number of active threads on a SM is 1536 at most). Thus there is abun-
dant threads resource to be scheduled, which can efficiently hide memory access la-
tency and make the SPs of a GPU work with full workload.
It can be seen that, data of reference frame and original frame are loaded to the
global memory first. Then each original subblock and its corresponding 256
Search WWH ::




Custom Search