Digital Signal Processing Reference
In-Depth Information
Tabl e 3
Estimated results for each 2 MByte 2-bank frame storage DRAM block
1 DRAM layer
4 DRAM layers
u
ber of sub-banks
1
2
4
8
16
1
2
4
8
16
Access time (non-burst) (ns)
9.49
7.37
6.65
6.57
7.36
9.36
7.25
6.53
6.44
6.67
Burst access time (ns)
7.17
4.83
4.11
3.91
4.39
7.09
4.84
4.03
3.83
3.93
Energy per
access(non-burst) (nJ)
0.71
1.00
1.29
1.61
1.99
0.93
1.22
1.28
1.59
1.00
Energy per burst access (nJ)
0.15
0.45
0.76
1.08
1.46
0.13
0.43
0.74
1.06
1.42
Footprint (mm 2 )
1.10
2.08
3.12
4.79
7.49
0.22
0.34
0.47
0.69
1.13
of 128 bits. We assume that the encoder must be able to support multi-frame motion
estimation with up-to five reference frames. Hence, we need six 2-bank frame 3D
DRAM blocks to store the current frame and five reference frames in the stacked
3D DRAM. This leads to an aggregate data I/O bandwidth of 128
768
bits, corresponding to 768 TSVs for logic-DRAM data interconnect. We use the
inter-sub-array 3D partitioning strategy presented in Sect. 4 to estimate 3D DRAM
performance. For the target HDTV1080p resolution, each image frame needs about
2MByte storage. Table 3 shows the estimated 3D DRAM results for each 2 MByte
2-bank frame storage DRAM block at 65-nm node. Since each sub-bank always has
eight sub-arrays, we explore the 3D DRAM design space by varying the size of each
sub-array and the number of sub-banks. In this study, the number of bit-lines in each
sub-array is fixed as 512.
Tab le 3 clearly shows a trade-off: as we increase the number of sub-banks by
reducing the size of each sub-array, we could directly reduce the access latency,
while the access energy consumption and DRAM footprint will increase. We
considered both 1-layer and 4-layer 3D DRAM stacking, and the results clearly
show the advantages of 4-layer 3D DRAM stacking. As pointed out in the above,
this proposed 3D DRAM attempts to access the data on the same word-line as much
as possible. Conventionally, access to the data on the same word-line is denoted
as burst access. Table 3 shows the difference between burst access and non-burst
access, which clearly shows that burst access is much more preferable.
The above presented image storage architecture can seamlessly support any
arbitrary motion vector search pattern, hence can naturally support various motion
estimation algorithms. In this case study we considered the following popular
algorithms: exhaustive full search (FS), three step search (TSS) [ 11 ] , new three
step search (NTSS) [ 48 ] , four step search (FSS) [ 60 , 73 ] , and diamond search
(DS) [ 70 , 71 ] . We apply these algorithms to two widely used HDTV 1080p video
sequences Tractor and Rush hour [ 72 ] , where 15 frames are extracted and analyzed
in each video sequence. Figure 12 shows the peak signal-to-noise ratio (PSNR)
vs. average memory energy consumption for processing each image frame without
using on-chip SRAM buffer. Each curve contains five points, corresponding to the
scenarios using 1, 2, 3, 4, and 5 reference frames, respectively. We note that, due to
the very regular memory access pattern in full search, explicit memory access can
be greatly reduced by data reuse in the motion estimation engine. In this study, we
×
6
=
 
 
Search WWH ::




Custom Search