Digital Signal Processing Reference
In-Depth Information
4.2
ADRES Design Space Exploration
In this part of our case study, we discuss the importance and the opportunities for
DSE within the ADRES template. First, we discuss some concrete ADRES in-
stances that have been used for extensive experimentation, including the fabrication
of working silicon samples. These examples demonstrate that very power-efficient
CGRAs can be designed for specific application domains.
Afterwards, we will show some examples of DSE results with respect to some of
the specific design options that were discussed in Sect. 3 .
4.2.1
Example ADRES Instances
During the development of the ADRES tool chain and design, two main ADRES
instances have been worked out. One was designed for multimedia applications
[ 5 , 46 ] and one for SDR baseband processing [ 9 , 10 ] . Their main differences are
presented in Table 1 . Both architectures have a 64-entry data RF (half rotating,
half non-rotating) that is shared with a unified three-issue VLIW processor that
executes non-loop code. Thus this shared RF has six read ports and three write
ports. Both CGRAs feature 16 FUs, of which four can access the memory (that
consists of four single-ported banks) through a queue mechanism that can resolve
bank conflicts. Most operations have latency one, with the exception of loads, stores,
and multiplications. One important difference between the two CGRAs relates to
their pipeline schemes, as depicted for a single IS (local RF and FU) in Table 1 .
As the local RFs are only buffered at their input, pipelining registers need to be
inserted in the paths to and from the FUs in order to obtain the desired frequency
targets as indicated in the table. The pipeline latches shown in Table 1 hence directly
contribute in the maximization of the factor f p in Eq. ( 1 ) . Because the instruction
sets and the target frequencies are different in both application domains, the SDR
CGRA has one more pipeline register than the multimedia CGRA, and they are
located at different places in the design.
Traditionally, in VLIWs or in out-of-order superscalar processors, deeper pipelin-
ing results in higher frequencies but also in lower IPCs because of larger branch
missprediction penalties. Following Eq. ( 1 ) , this can result in lower performance.
In CGRAs, however, this is not necessarily the case, as explained in Sect. 3.3.1 .
To illustrate this, Table 3 includes IPCs obtained when generating code for both
CGRAs with and without the pipelining latches.
The benchmarks mapped onto the multimedia ADRES CGRA are a H.264AVC
video decoder, a wavelet-based video decoder, an MPEG4 video coder, a black-and-
white TIFF image filter, and a SHA-2 encryption algorithm. For each application
at most the ten hottest inner loops are included in the table. For the SDR ADRES
CGRA, we selected two baseband modem benchmarks: one WLAN MIMO Channel
Estimation and one that implements the remainder of a WLAN SISO receiver. All
 
Search WWH ::




Custom Search