Information Technology Reference
In-Depth Information
We used the Joint Model reference software (JM) [4] version 17.2 for encoding
the contents in order to generate our test suite consisting of all only intra-coded
videos. The streams were then parsed with our in-house H.264 decoder and the
task-graphs were generated. For our experiments, we used the value 4 for both
the cache thresholds and 0.5 for k . Simulations were done on a 2GHz machine.
In order to obtain the number of cycles needed per macro-block, we note that
a 1080p30 video has 1920 * 1080 pixels = 120 * 67 = 8040 macroblocks and
a frame rate of 30 frames/s. Thus, 1 frame has to be decoded every 1/30 s. In
other words, 1 macroblock has to be decoded every 1/(30 * 8040) = 4.145 μs .
Decoding of a frame is slowest when all macro-blocks are of 4x4 type. Assuming a
2 GHz machine with 15% margin, the machine can execute 1 . 7
10 9 instructions
10 6 =
7046.5 for one 4x4 macroblock. But only 2/3 of the cycles are used to process
luma samples. Thus, number of cycles required for luma samples = 2/3 * 7000
= 4667 cycles. An 8x8 luma macroblock on an average requires 2/3 * 4667 =
3111 cycles. Therefore, one 16x16 luma macroblock requires 1/4 * 4667 = 1167
cycles. We use these values in our implementation to compute the speedup and
execution time.
We compared the speedup obtained by our method over the 3D-wavefront
approach, which is the most commonly used parallel approach currently avail-
able. We implemented the 3D-wavefront method in a multi-core setting assuming
that the macroblocks to be assigned to each processor have been kept in sepa-
rate queues, and this has been done before the actual decoding of macroblocks
begin [3]. While implementing Algorithm 1, we assumed that there is a single
shared queue (in L2) present for all ready macroblocks. The communication
time for fetching the required residual data is considered while computing the
speedup values. A processor that becomes idle executes the proposed heuristic
to select the next macroblock to be decoded. The execution of our heuristic was
found to take 40 cycles per macroblock present in the queue. Speedup values,
measured as the ratio of the number of processor cycles used by us compared to
that by 3D-wavefront are shown in Table 1. Each row of the table (between rows
2 to 12) represents the speedup obtained for a particular video clip when 4, 8,
16, 32 and 64 processors are present. The final three rows show the minimum,
maximum and average speedup values. For most of the videos, our algorithm
offers improvements (speedup > 1) or comparable performances. The ones, for
which our method is slower than the wavefront method, the runtime overhead
(graph extraction, dependency structure building, processing time calculation,
scheduling overhead) turns out to slow down the decode process, in comparison
to a simple-minded static strategy. We note that the speedup is minimum for
32-core processors. This can be explained by observing that the number of mac-
roblocks in ready queue peaks at this point. Having fewer number of processors
reduces the number of macroblocks actually being made available at each cycle,
whereas having higher number of processors ensures that the macroblocks kept
waiting are very few in number.
10 9
every second. Thus, number of cycles required = 1 . 7
4 . 145
 
Search WWH ::




Custom Search