Decoder Hardware Architecture for HEVC - High Efficiency Video Coding (HEVC) - page 313

Graphics Reference

In-Depth Information

a

b

W

W

Fig. 10.7 Eliminate read/write with registers for an SRAM-based transpose memory. ( a ) Pipeline

stall due to transpose SRAM delay for 32

32 TU. ( b ) Row caching to avoid stall

processed, the transform module must wait for it to be written to the SRAM before

it can begin processing the row. This results in a delay of nine cycles for 32 32

TU. In general, for an N N TU, this delay is equal to N/4 C 1 cycles. This results

in a pipeline stall of 1.75-25 % cycles depending on the TU size. This stall can be

avoided through the use of a row cache that stores the first N C 4 pixels in registers.

As shown in Fig. 10.7 b, the row cache is read for the first nine cycles of the row

transforms while the last column is being stored in the SRAM.

This transpose memory design using SRAM scales very well for lower through-

puts. A 2-pixel/cycle transpose memory would need two banks each with 512 entries

(16-bit/entry). For higher throughputs, one needs more banks each with fewer

entries. Such short SRAM banks have a larger area overhead of sense-amplifiers

and other read-out circuitry. For throughputs higher than 32-pixel/cycle, register

based transpose memory [ 23 ] is more area-efficient.

10.4.3

Inverse DCT Engine

The IDCT engine can be optimized by observing that the N-pt IDCT matrix has at

most N unique coefficients differing only in sign. This is also true of the matrices

obtained by even-odd decomposition of the IDCT matrix, such as the 16 16 matrix

of the 32-pt IDCT. This 256-element matrix contains 15 unique numbers: 90, 88, 85,

82, 78, 73, 67, 61, 54, 46, 38, 31, 22, 13, 4 (and their additive inverses). The matrix

is multiplied with the odd-indexed coefficients in the 32-pt IDCT. In a 4-pixel/cycle

Next Page

High Efficiency Video Coding (HEVC)

Search WWH ::

Custom Search

Home