Digital Signal Processing Reference
In-Depth Information
candidates are loaded to the shared memory gradually, processed by a thread block of
kernel SAD computing. Each thread (Ti) performs a SAD computing of a candidate
and an original subblock. It significantly reduces global memory accesses at the re-
striction of on-chip memory capability. Unlike LRF on Storm, the shared memory is
not a pure software managed memory. As one of date providers for a large amount of
threads (they also can access registers and global memory), it can be used by defining
some shared memory groups, but nobody knows which thread blocks are accessing it
at any time. Thus performance optimization of the H.264 encoder on a GPU is still a
very hard work.
4
Results and Discussion
As shown in table 1, the evaluation is performed on five kinds of programmable pro-
cessors, which are a desktop CPU, an embedded CPU, a DSP, a stream processor and
a GPU.
Table 1. Experimental platforms' configurations
Table 2. Encoding video quality before and after streaming
Correctness and Quality. On each platform, both X264 and the stream code of
H.264 encode three high definition video sequences (only the stream code can run on
GPU), meanwhile a standard H.264 decoder in VLC media player decodes the en-
coded bit streams directly to verify the correctness. Comparison of the two decoded
results shows that the output bit streams of our stream code is correct. Table 2 shows
the detailed results.
Search WWH ::




Custom Search