High-Performance Implementation of Stream Model Based H.264 Video Coding on Parallel Processors - Multimedia and Signal Processing - page 422

Digital Signal Processing Reference

In-Depth Information

In other word, 16 arithmetic lanes process 16 stream elements at once time, the stream

elements may be a Macroblock or a subblock.

Macroblock Level Parallelism. For Inter-prediction, Transform Coding, CAVLC,

Deblock filter of the stream H.264 encoder, parallel processing granularity is cat the

Macroblock level.

Fig. 1. Macroblock level parallelism of ME on Storm processor

We take Motion Estimate (ME) as an example to illustrate how to map kernel with

Macroblock-level parallelism onto lanes with a DLP degree of 16, as shown in Figure

1. Other modules' kernels are mapped likewise. Inter Prediction is mainly imple-

mented by three parameterized kernels: SAD Computing, SAD Merging, The best

MVs Select, and the data process granularity of each kernel is up to 3MByte. The

input stream of the first kernel of inter-prediction may be defined as a Macroblock

stream of a 16x1920 pixel Stripe for a 1920x1080 image (stream length is also re-

stricted by the size of on-chip memory. In Storm, all streams processed by a kernel

have to be loaded on LRF previously and manually). Therefore, lanes of STORM can

naturally process Macroblocks in stream in module 16 for these kernels, while all the

intermediate streams are also organized by module 16 in on-chip memory-LRF.

Subblock Level Parallelism. For Intra-prediction, parallel processing granularity is

4x4 subblock level. We take kernels Predict16x16 as examples to illustrate how to

map kernels with subblock level parallelism onto lanes with a DLP degree of 16.

16x16 Luma Intra-Prediction kernel's data process granularity is up to 16 subblock.

Thus each lane processes a subblock of a Macroblock independently, as shown in

Figure 2(a).

4x4 Luma Intra-Prediction kernel's data process granularity is 4n subblocks. These

subblocks are processed in a 7-stage procedure by using 16 lanes, as shown in Figure

2(b). All lanes read stream elements (subblocks) in sequence, so that data stream can

be organized conveniently. Since the DLP degree of each stage in a slice is less than

16 and all lanes have to compute in parallel, only specific lanes' results are valid at

each stage. The results produced by the last stage are transferred to the next stage by

Next Page

Multimedia and Signal Processing

Search WWH ::

Custom Search

Home