DSP Systems Using Three-Dimensional Integration Technology - Signal Processing Systems

Digital Signal Processing Reference

In-Depth Information

Tabl e 3

Estimated results for each 2 MByte 2-bank frame storage DRAM block

1 DRAM layer

4 DRAM layers

ber of sub-banks

Access time (non-burst) (ns)

9.49

7.37

6.65

6.57

7.36

9.36

7.25

6.53

6.44

6.67

Burst access time (ns)

7.17

4.83

4.11

3.91

4.39

7.09

4.84

4.03

3.83

3.93

Energy per

access(non-burst) (nJ)

0.71

1.00

1.29

1.61

1.99

0.93

1.22

1.28

1.59

1.00

Energy per burst access (nJ)

0.15

0.45

0.76

1.08

1.46

0.13

0.43

0.74

1.06

1.42

Footprint (mm 2 )

1.10

2.08

3.12

4.79

7.49

0.22

0.34

0.47

0.69

1.13

of 128 bits. We assume that the encoder must be able to support multi-frame motion

estimation with up-to five reference frames. Hence, we need six 2-bank frame 3D

DRAM blocks to store the current frame and five reference frames in the stacked

3D DRAM. This leads to an aggregate data I/O bandwidth of 128

768

bits, corresponding to 768 TSVs for logic-DRAM data interconnect. We use the

inter-sub-array 3D partitioning strategy presented in Sect. 4 to estimate 3D DRAM

performance. For the target HDTV1080p resolution, each image frame needs about

2MByte storage. Table 3 shows the estimated 3D DRAM results for each 2 MByte

2-bank frame storage DRAM block at 65-nm node. Since each sub-bank always has

eight sub-arrays, we explore the 3D DRAM design space by varying the size of each

sub-array and the number of sub-banks. In this study, the number of bit-lines in each

sub-array is fixed as 512.

Tab le 3 clearly shows a trade-off: as we increase the number of sub-banks by

reducing the size of each sub-array, we could directly reduce the access latency,

while the access energy consumption and DRAM footprint will increase. We

considered both 1-layer and 4-layer 3D DRAM stacking, and the results clearly

show the advantages of 4-layer 3D DRAM stacking. As pointed out in the above,

this proposed 3D DRAM attempts to access the data on the same word-line as much

as possible. Conventionally, access to the data on the same word-line is denoted

as burst access. Table 3 shows the difference between burst access and non-burst

access, which clearly shows that burst access is much more preferable.

The above presented image storage architecture can seamlessly support any

arbitrary motion vector search pattern, hence can naturally support various motion

estimation algorithms. In this case study we considered the following popular

algorithms: exhaustive full search (FS), three step search (TSS) [ 11 ] , new three

step search (NTSS) [ 48 ] , four step search (FSS) [ 60 , 73 ] , and diamond search

(DS) [ 70 , 71 ] . We apply these algorithms to two widely used HDTV 1080p video

sequences Tractor and Rush hour [ 72 ] , where 15 frames are extracted and analyzed

in each video sequence. Figure 12 shows the peak signal-to-noise ratio (PSNR)

vs. average memory energy consumption for processing each image frame without

using on-chip SRAM buffer. Each curve contains five points, corresponding to the

scenarios using 1, 2, 3, 4, and 5 reference frames, respectively. We note that, due to

the very regular memory access pattern in full search, explicit memory access can

be greatly reduced by data reuse in the motion estimation engine. In this study, we

Signal Processing Systems

Search WWH ::

Custom Search

Home