A Cache-Aware Strategy for H.264 Decoding on Multi-processor Architectures - VLSI Design and Test

Information Technology Reference

In-Depth Information

video structure with an aim of achieving better decode performance by load bal-

ancing and distribution of workload among available processors, while honouring

video dependency constraints as applicable.

For intra-coded videos, strategies exploiting macro-block level parallelism have

been found to be more successful in general. The problem in this setting es-

sentially is to identify the macro-block dependency structure inside a H.264

slice/frame, in order to process the macro-blocks in parallel (honouring depen-

dencies as applicable) on the available processors in a multi-core setting, with

an objective to minimize the end-to-end decode time. Both static and dynamic

macro-block scheduling strategies have been proposed. Static scheduling strate-

gies in general assume worst case dependency patterns among the constituent

macro-blocks and often, equal processing times, irrespective of their types.

This paper has two important considerations. A static scheduling approach,

which assumes uniform macro-block processing times, leads to poor processor

utilization. In reality, macro-block processing times vary depending on the in-

puts and the dependency structure. Hence, it is possible to improve the effective

processor utilization by adopting a dynamic scheduling approach that assigns

macro-blocks to free processors as soon as they are ready, as opposed to a static

solution that would normally schedule at pre-defined intervals. In addition to im-

proving utilization, we also show that the effective speed-up obtained crucially

depends on the cache interaction of the decode strategy in a multi-processor set-

ting with a hierarchical (private L1, shared L2, DRAM) memory structure. Many

of the existing decode strategies often do not consider the cache misses resulting

from cache oblivious selection of the macro-blocks to be processed, which in turn

leads to significant slowdown in decoder performance due to frequent accesses

to the lower and slower memory levels.

Our work has two proposals for harnessing the effective power of parallel

computation in a multi-core setting. On one hand, we propose a cache-aware [5]

scheduling strategy to minimize the number of cache misses, by carefully se-

lecting the macro-blocks to be considered next, keeping in view the chance of

a macro-block it depends on, getting evicted from the cache due to capacity

or conflict misses. On the other hand, we attempt to improve the number of

macro-blocks available for processing at every time point, which in turn implies

better processor utilization and hence, improvement in speedup.

We implemented our schedule heuristic and evaluated it on a number of stan-

dard benchmarks. Experiments have shown significant speed up as compared to

methods that currently exist.

2 Background and Related Work

A H.264 video [1, 6] consists of a sequence of frames. A frame is an array of

luma samples and two corresponding arrays of chroma samples. Each frame is

further divided into spatial units called slices. A slice consists of blocks of 16 x 16

pixels, known as macro-blocks (MB). A macro-block contains type information

describing the choice of methods used to code the macro-block and prediction

VLSI Design and Test

Search WWH ::

Custom Search

Home