High-Performance Implementation of Stream Model Based H.264 Video Coding on Parallel Processors - Multimedia and Signal Processing

Digital Signal Processing Reference

In-Depth Information

candidates are loaded to the shared memory gradually, processed by a thread block of

kernel SAD computing. Each thread (Ti) performs a SAD computing of a candidate

and an original subblock. It significantly reduces global memory accesses at the re-

striction of on-chip memory capability. Unlike LRF on Storm, the shared memory is

not a pure software managed memory. As one of date providers for a large amount of

threads (they also can access registers and global memory), it can be used by defining

some shared memory groups, but nobody knows which thread blocks are accessing it

at any time. Thus performance optimization of the H.264 encoder on a GPU is still a

very hard work.

4

Results and Discussion

As shown in table 1, the evaluation is performed on five kinds of programmable pro-

cessors, which are a desktop CPU, an embedded CPU, a DSP, a stream processor and

a GPU.

Table 1. Experimental platforms' configurations

Table 2. Encoding video quality before and after streaming

Correctness and Quality. On each platform, both X264 and the stream code of

H.264 encode three high definition video sequences (only the stream code can run on

GPU), meanwhile a standard H.264 decoder in VLC media player decodes the en-

coded bit streams directly to verify the correctness. Comparison of the two decoded

results shows that the output bit streams of our stream code is correct. Table 2 shows

the detailed results.

Search WWH ::

Custom Search

Home