A Cache-Aware Strategy for H.264 Decoding on Multi-processor Architectures - VLSI Design and Test

Information Technology Reference

In-Depth Information

cache. The reason behind avoiding misses from L2 cache getting higher priority

is that data that is expelled from L2 cache takes a much longer time to bring

back compared to L1 cache. If we find that no parent node is in danger of being

evicted from either L1 or from L2, we then calculate the average priority of each

node, assuming that L1 has a higher priority than L2. This is done to ensure

that memory access times are minimized. The other component, p parallel ,isused

to assign priority depending on the number of outgoing edges. The higher the

number of child nodes, the greater is the value of p parallel . k denotes a scaling

factor. For source nodes, this also includes the number of its spatial neighbours

which have been processed. The total priority, p , is then calculated by assign-

ing weightage to the two priority factors. The node having the highest prior-

ity is then processed by the core that executed the scheduler. If there is a tie

among two processor who started the priority computation at the same time and

ended up selecting the same MB, we resolve it arbitrarily in favour of any of the

processors.

We now explain the working of our algorithm on Figure 2. We assume L1

can accommodate 2 macroblocks, and L2 can accommodate 8 macroblocks. Let

us assume the macroblock labelled 15A is dependent on those marked 13B and

11A, while 15B is dependent on 11B and 13C. The first processor is idle and

calls the scheduler. 11A is in L1 with ˄ value 1, 13B is in L2 with ˄ 5, 11B is

in L2 with ˄ 4, whereas 13C is in L2 with ˄ value 8. The threshold values are 2

for L1 and 4 for L2. No MB in L2 is in the danger of being evicted (comparing

˄ values with the threshold), and hence, we look at nodes whose parents are in

L1. Then, 15A is scheduled to be processed next using our algorithm, since its

parent 11A will be flushed out of L1 cache before the threshold. If, however, 11A

had a ˄ value of 2, and 13C had a ˄ value 3, then the macroblock marked 15B

would have been selected.

5 Experimental Results

We implemented an architecture simulator to evaluate our proposed scheduling

strategy in a multi-processor setting. Our implementation assumes a cache block

size of 32 bytes, and fetching from DRAM in bursts of 256 bytes. We also assume

that fetching one line of data requires 1 cycle for L1, and 8 cycles for L2. Our

implementation uses ecient data structure for modelling the processors, the

cache structure and the main memory. In this way, we are able to eciently

obtain the cache entries that are about to be flushed.

We ran our method on a set of standard test videos. We selected a set of

11 video clips which are widely used across the digital video domain for bench-

marking the decoding, encoding and other pre/post processing algorithms. All

the clips are of HD (720p) resolution and contain 150 frames. Most of them have

high details and large variations within a frame; hence the intra coded frames

are expected to have very good distribution of intra prediction modes (and hence

the dependencies among macroblocks). Our goal was to generate an intra-only

stream suite which has a good variation of dependency relations among the

macroblocks.

VLSI Design and Test

Search WWH ::

Custom Search

Home