A Cache-Aware Strategy for H.264 Decoding on Multi-processor Architectures - VLSI Design and Test

Information Technology Reference

In-Depth Information

A Cache-Aware Strategy for H.264 Decoding

on Multi-processor Architectures

Arani Bhattacharya 1 , Ansuman Banerjee 1 , Susmita Sur-Kolay 1 ,

Prasenjit Basu, and Bhaskar J. Karmakar 2 ,∗

1 Indian Statistical Institute

{ arani89,prasenjit.basu } @gmail.com,

{ ansuman,ssk } @isical.ac.in

2 S3Craft Technologies

bhaskar@s3craft.com

Abstract. H.264-AVC is one of the most popular formats for the record-

ing, compression and distribution of video. Encoders and decoders for

the H.264 standard are widely in demand, and ecient strategies for

enhancing their performance have been areas of active research. With

the proliferation of many-core architectures in the embedded commu-

nity, there has been a trend towards parallelizing implementations of

encoders and decoders. In this paper, we present a run time heuristic

which exploits macro-block level parallelism and ecient scheduling in-

side a H.264 decoder to reduce the number of cache misses and improve

the processor utilization. Experiments on standard benchmarks show a

significant speed-up over contemporary strategies proposed in literature.

1 Introduction

H.264/MPEG-4 Part 10 or AVC (Advanced Video Coding) is one of the most

common video formats in recent times. H.264 provides much better compression

ratios than most other video formats such as H.263 and MPEG-2. Encoders and

decoders for the H.264 standard are widely in demand, and ecient strategies

for enhancing their performance have been areas of active research.

Security applications typically involve widespread deployment of H.264. In

the security context, videos are mostly intra-coded, i.e. all existent motion de-

pendencies are within the same frame. Intra-coded videos have therefore been a

subject of active research in both the academic and industrial setting.

With the proliferation of many core architectures in the embedded commu-

nity, there has been a trend towards parallelizing implementations of encoders

and decoders. In general, these proposals have focused on ecient exploitation

of inherent parallelism (at the frame level, slice level or macro-block level) in the

∗ This work was started while Bhaskar J. Karmakar and Prasenjit Basu were at Texas

Instruments, India. The authors would like to acknowledge the financial assistance

received from Texas Instruments, and thank Dr. Mahesh Mehendale, Fellow, Texas

Instruments for his continuous encouragement and support.

Search WWH ::

Custom Search

Home