Graphics Reference
In-Depth Information
FME Reference Broadcast
ME Reference Prefetch
IME
FME
Ref. L1
SRAM
Ref. L1
SRAM
Intp. 1
Ref. L1
SRAM
Mrg. Intp.
Refine
Range
Control
Ref. L2
SRAM
(EO)
FME
Control
AMVP
Gen.
Intp. 0
Cur.
Luma
SRAM
Intlv. MUX
Cur.
Luma
Buf
AMVP Mode
Cost
2D SAD
Trees
Difference
Difference
AMVP Mode
Cost
+
Hadamard
Hadamard
IME
Control
Interpolation
Comparing
Best MV
Comparing
Best MV
PU Mode
Pre-decision
Abs. &
Sum.
Abs. & Sum.
RES
SRAM
REC
SRAM
Inter PU
Modes
Inter PU
Best MVs
Fig. 11.2
HEVC inter prediction architecture
64 Ref-Pels
64x64 Cur-LCU
Buffer
64x64 Ref-LCU
Systolic Array
4096 PE Array
(Subtract + Absolute)
64 2-D Adder Trees
for 8x8-PU Blocks
2-D SAD
Tree
One Merge Split Tree
for Larger Blocks
Decision Unit
and SAD Buffer
Fig. 11.3
2D SAD adder tree architecture
In this work, the IME architecture uses a parallel-PU IME architecture based on
2D SAD adder tree to meet the high throughput requirement [ 10 ], as illustrated in
Fig. 11.3 . In parallel-PU IME, all of the PU blocks inside current CTU are done
in parallel. Instead of looping over all the pixels inside the PU block, the cost for
bigger PU can be simply derived from the cost of the sub-divided PU (i.e. a bottom
up approach). For example, retrieving the SAD cost of 16 8 PU can be simply
done by adding the two co-located 8 8 PUs. By utilizing the 2D SAD adder tree,
we may retrieve all PU costs for certain motion vectors at once. However, there are
dependencies among PUs, such as reference motion vectors from the neighboring
PUs used to calculate the predictor motion vector to use in motion vector cost
calculation. To enable parallelism, we estimate the motion vector predictors from
the nearest available vectors just outside of the current CTU. For the IME search
algorithm, we need to select one of the two advanced motion vectors predictors
 
Search WWH ::




Custom Search