A Cache-Aware Strategy for H.264 Decoding on Multi-processor Architectures - VLSI Design and Test

Information Technology Reference

In-Depth Information

We used the Joint Model reference software (JM) [4] version 17.2 for encoding

the contents in order to generate our test suite consisting of all only intra-coded

videos. The streams were then parsed with our in-house H.264 decoder and the

task-graphs were generated. For our experiments, we used the value 4 for both

the cache thresholds and 0.5 for k . Simulations were done on a 2GHz machine.

In order to obtain the number of cycles needed per macro-block, we note that

a 1080p30 video has 1920 * 1080 pixels = 120 * 67 = 8040 macroblocks and

a frame rate of 30 frames/s. Thus, 1 frame has to be decoded every 1/30 s. In

other words, 1 macroblock has to be decoded every 1/(30 * 8040) = 4.145 μs .

Decoding of a frame is slowest when all macro-blocks are of 4x4 type. Assuming a

2 GHz machine with 15% margin, the machine can execute 1 . 7

10 9 instructions

∗

10 − 6 =

7046.5 for one 4x4 macroblock. But only 2/3 of the cycles are used to process

luma samples. Thus, number of cycles required for luma samples = 2/3 * 7000

= 4667 cycles. An 8x8 luma macroblock on an average requires 2/3 * 4667 =

3111 cycles. Therefore, one 16x16 luma macroblock requires 1/4 * 4667 = 1167

cycles. We use these values in our implementation to compute the speedup and

execution time.

We compared the speedup obtained by our method over the 3D-wavefront

approach, which is the most commonly used parallel approach currently avail-

able. We implemented the 3D-wavefront method in a multi-core setting assuming

that the macroblocks to be assigned to each processor have been kept in sepa-

rate queues, and this has been done before the actual decoding of macroblocks

begin [3]. While implementing Algorithm 1, we assumed that there is a single

shared queue (in L2) present for all ready macroblocks. The communication

time for fetching the required residual data is considered while computing the

speedup values. A processor that becomes idle executes the proposed heuristic

to select the next macroblock to be decoded. The execution of our heuristic was

found to take 40 cycles per macroblock present in the queue. Speedup values,

measured as the ratio of the number of processor cycles used by us compared to

that by 3D-wavefront are shown in Table 1. Each row of the table (between rows

2 to 12) represents the speedup obtained for a particular video clip when 4, 8,

16, 32 and 64 processors are present. The final three rows show the minimum,

maximum and average speedup values. For most of the videos, our algorithm

offers improvements (speedup > 1) or comparable performances. The ones, for

which our method is slower than the wavefront method, the runtime overhead

(graph extraction, dependency structure building, processing time calculation,

scheduling overhead) turns out to slow down the decode process, in comparison

to a simple-minded static strategy. We note that the speedup is minimum for

32-core processors. This can be explained by observing that the number of mac-

roblocks in ready queue peaks at this point. Having fewer number of processors

reduces the number of macroblocks actually being made available at each cycle,

whereas having higher number of processors ensures that the macroblocks kept

waiting are very few in number.

10 9

every second. Thus, number of cycles required = 1 . 7

∗

4 . 145

∗

VLSI Design and Test

Search WWH ::

Custom Search

Home