Digital Signal Processing Reference
In-Depth Information
In other word, 16 arithmetic lanes process 16 stream elements at once time, the stream
elements may be a Macroblock or a subblock.
Macroblock Level Parallelism. For Inter-prediction, Transform Coding, CAVLC,
Deblock filter of the stream H.264 encoder, parallel processing granularity is cat the
Macroblock level.
Fig. 1. Macroblock level parallelism of ME on Storm processor
We take Motion Estimate (ME) as an example to illustrate how to map kernel with
Macroblock-level parallelism onto lanes with a DLP degree of 16, as shown in Figure
1. Other modules' kernels are mapped likewise. Inter Prediction is mainly imple-
mented by three parameterized kernels: SAD Computing, SAD Merging, The best
MVs Select, and the data process granularity of each kernel is up to 3MByte. The
input stream of the first kernel of inter-prediction may be defined as a Macroblock
stream of a 16x1920 pixel Stripe for a 1920x1080 image (stream length is also re-
stricted by the size of on-chip memory. In Storm, all streams processed by a kernel
have to be loaded on LRF previously and manually). Therefore, lanes of STORM can
naturally process Macroblocks in stream in module 16 for these kernels, while all the
intermediate streams are also organized by module 16 in on-chip memory-LRF.
Subblock Level Parallelism. For Intra-prediction, parallel processing granularity is
4x4 subblock level. We take kernels Predict16x16 as examples to illustrate how to
map kernels with subblock level parallelism onto lanes with a DLP degree of 16.
16x16 Luma Intra-Prediction kernel's data process granularity is up to 16 subblock.
Thus each lane processes a subblock of a Macroblock independently, as shown in
Figure 2(a).
4x4 Luma Intra-Prediction kernel's data process granularity is 4n subblocks. These
subblocks are processed in a 7-stage procedure by using 16 lanes, as shown in Figure
2(b). All lanes read stream elements (subblocks) in sequence, so that data stream can
be organized conveniently. Since the DLP degree of each stage in a slice is less than
16 and all lanes have to compute in parallel, only specific lanes' results are valid at
each stage. The results produced by the last stage are transferred to the next stage by
 
Search WWH ::




Custom Search