Information Technology Reference
In-Depth Information
cache. The reason behind avoiding misses from L2 cache getting higher priority
is that data that is expelled from L2 cache takes a much longer time to bring
back compared to L1 cache. If we find that no parent node is in danger of being
evicted from either L1 or from L2, we then calculate the average priority of each
node, assuming that L1 has a higher priority than L2. This is done to ensure
that memory access times are minimized. The other component, p parallel ,isused
to assign priority depending on the number of outgoing edges. The higher the
number of child nodes, the greater is the value of p parallel . k denotes a scaling
factor. For source nodes, this also includes the number of its spatial neighbours
which have been processed. The total priority, p , is then calculated by assign-
ing weightage to the two priority factors. The node having the highest prior-
ity is then processed by the core that executed the scheduler. If there is a tie
among two processor who started the priority computation at the same time and
ended up selecting the same MB, we resolve it arbitrarily in favour of any of the
processors.
We now explain the working of our algorithm on Figure 2. We assume L1
can accommodate 2 macroblocks, and L2 can accommodate 8 macroblocks. Let
us assume the macroblock labelled 15A is dependent on those marked 13B and
11A, while 15B is dependent on 11B and 13C. The first processor is idle and
calls the scheduler. 11A is in L1 with ˄ value 1, 13B is in L2 with ˄ 5, 11B is
in L2 with ˄ 4, whereas 13C is in L2 with ˄ value 8. The threshold values are 2
for L1 and 4 for L2. No MB in L2 is in the danger of being evicted (comparing
˄ values with the threshold), and hence, we look at nodes whose parents are in
L1. Then, 15A is scheduled to be processed next using our algorithm, since its
parent 11A will be flushed out of L1 cache before the threshold. If, however, 11A
had a ˄ value of 2, and 13C had a ˄ value 3, then the macroblock marked 15B
would have been selected.
5 Experimental Results
We implemented an architecture simulator to evaluate our proposed scheduling
strategy in a multi-processor setting. Our implementation assumes a cache block
size of 32 bytes, and fetching from DRAM in bursts of 256 bytes. We also assume
that fetching one line of data requires 1 cycle for L1, and 8 cycles for L2. Our
implementation uses ecient data structure for modelling the processors, the
cache structure and the main memory. In this way, we are able to eciently
obtain the cache entries that are about to be flushed.
We ran our method on a set of standard test videos. We selected a set of
11 video clips which are widely used across the digital video domain for bench-
marking the decoding, encoding and other pre/post processing algorithms. All
the clips are of HD (720p) resolution and contain 150 frames. Most of them have
high details and large variations within a frame; hence the intra coded frames
are expected to have very good distribution of intra prediction modes (and hence
the dependencies among macroblocks). Our goal was to generate an intra-only
stream suite which has a good variation of dependency relations among the
macroblocks.
 
Search WWH ::




Custom Search