High-Performance Implementation of Stream Model Based H.264 Video Coding on Parallel Processors - Multimedia and Signal Processing - page 424

Digital Signal Processing Reference

In-Depth Information

Kernels of the stream H.264 encoder are executed on a GPU in the form of thread

blocks. A thread is defined as an iteration of a kernel. Subblock level parallelism

and slice parallel techniques are applied on all threads to match the parallel processing

granularity and parallelism degree on a GPU. This paper takes kernel SAD computing

of the ME module as an example to illustrate how to map the stream code onto a

GPU, as shown in Figure 3.

In the example, the number of threads and thread blocks executed simultaneously

on GPU can be calculated according to the following formulas (the size of a thread

block is set to be 256 threads by experience, the size of a subblock is 4*4, different

rows of Macroblocks in a slice cannot be processed in parallel):

Thread_nums = (Frame_width/4)*(MB_height/4)*N*32*32 (1)

ThreadBlock_nums = Thread_nums/256

(2)

N is the number of slices of an image;

32*32 is the size of searching window.

Fig. 3. Mapping of kernel SAD computing on GPU

Obviously, the number of threads calculated by formula (1) constitutes one of sub-

block which can be processed in parallel for an image. For a full high definition video

sequence (resolution is 1920*1080), the number of threads and thread blocks ex-

ecuted simultaneously on a GPU are up to 1966080*N, 7680*N respectively. Since

there are 7 SMs on GTX460, each SM can be assigned with hundreds or even thou-

sands thread blocks. The number of active thread blocks on a SM is set as 4 at one

time (the total number of active threads on a SM is 1536 at most). Thus there is abun-

dant threads resource to be scheduled, which can efficiently hide memory access la-

tency and make the SPs of a GPU work with full workload.

It can be seen that, data of reference frame and original frame are loaded to the

global memory first. Then each original subblock and its corresponding 256

Next Page

Multimedia and Signal Processing

Search WWH ::

Custom Search

Home