Digital Signal Processing Reference
In-Depth Information
Tabl e 2 Configuration parameters used in Trimaran [ 10 ]
Frequency
200 MHz
Number of DRAM dies
6
58.5 mm 2
Die size
Parameter of one cluster
ALU number: 4; register file: 2 kB multi-ported; area
size: 2.77 mm 2 ; technology: 90-nm
2D SRAM instruction cache (per cluster)
4 kB, 2 way, 16 byte blocks; access latency:1.26 ns
2D SRAM data cache (per cluster)
4 kB, 2 way, 16 byte blocks; access latency: 1.26 ns
Shared 2D SRAM L2 cache
256 kB, 2 way, 32 byte blocks; access latency:
2.43 ns; area: 5.69 mm 2 ; technology: 90-nm
Shared 3D DRAM L2 cache
512 kB, 2 way, 32 byte blocks; access latency:
4.12 ns
Shared main memory
1 GB, 4 kB page size; access latency: 22.5 ns;
technology: 65-nm
5.3
Performance Evaluation
To evaluate the above presented two 3D VLIW digital signal processor architecture
design options as illustrated in Fig. 3 a ,b, we use the Trimaran simulator [ 10 ] to carry
out simulations over a wide range of signal processing benchmarks. Trimaran is an
integrated compilation and performance monitoring infrastructure and covers HPL
PlayDoh architecture [ 40 ] . We further integrate the memory subsystems of M5 [ 6 ]
to strengthen the memory system simulation capability of Trimaran. For the 3D
VLIW processor and DRAM integration, we assume that the VLIW processor and
3D DRAM are designed using 90- and 65-nm technologies, respectively. For the
VLIW processor, each cluster contains 4 ALUs, a 2 kB multi-ported register file,
a private 4 kB instruction cache and a 4 kB data cache. All the clusters share one
256 kB L2 cache.
We estimate the cluster silicon area based on [ 42 ] that reports a silicon imple-
mentation of a VLIW digital signal processor at 0.13-
m technology node similar
to the architecture being considered in this work. In [ 42 ] , one cluster, consisting of 6
ALUs and a 16 kB single-ported register file, occupies about 3.2 mm 2 , and the entire
VLIW digital signal processor contains 16 clusters and occupies 155 mm 2 (note
that besides the 16 clusters, it contains two MIPS CPUs and a few other peripheral
circuits). Accordingly, we estimate that one cluster with 4 ALUs and a 2 kB multi-
ported register file is about 2.77 mm 2 at 90-nm technology node, the 256 kB L2
cache is 5.69 mm 2 , and the entire VLIW processor die size is 58.5 mm 2 . Therefore,
when we move the 256 kB L2 cache into 3D DRAM domain, we can re-allocate
the silicon area to two additional clusters. We have that 58.5 mm 2 die size with
six stacked DRAM dies can achieve a total storage capacity of 1 GB with 128-bits
output width at 65-nm node. Table 2 lists all the basic configuration parameters used
in the simulations.
We use the conventional system architecture with off-chip DRAM as a baseline
configuration, where the number of clusters is 2 and the L2 cache size is 256 KB.
μ
 
 
Search WWH ::




Custom Search