Coarse-Grained Reconfigurable Array Architectures - Signal Processing Systems

Digital Signal Processing Reference

In-Depth Information

4.2

ADRES Design Space Exploration

In this part of our case study, we discuss the importance and the opportunities for

DSE within the ADRES template. First, we discuss some concrete ADRES in-

stances that have been used for extensive experimentation, including the fabrication

of working silicon samples. These examples demonstrate that very power-efficient

CGRAs can be designed for specific application domains.

Afterwards, we will show some examples of DSE results with respect to some of

the specific design options that were discussed in Sect. 3 .

4.2.1

Example ADRES Instances

During the development of the ADRES tool chain and design, two main ADRES

instances have been worked out. One was designed for multimedia applications

[ 5 , 46 ] and one for SDR baseband processing [ 9 , 10 ] . Their main differences are

presented in Table 1 . Both architectures have a 64-entry data RF (half rotating,

half non-rotating) that is shared with a unified three-issue VLIW processor that

executes non-loop code. Thus this shared RF has six read ports and three write

ports. Both CGRAs feature 16 FUs, of which four can access the memory (that

consists of four single-ported banks) through a queue mechanism that can resolve

bank conflicts. Most operations have latency one, with the exception of loads, stores,

and multiplications. One important difference between the two CGRAs relates to

their pipeline schemes, as depicted for a single IS (local RF and FU) in Table 1 .

As the local RFs are only buffered at their input, pipelining registers need to be

inserted in the paths to and from the FUs in order to obtain the desired frequency

targets as indicated in the table. The pipeline latches shown in Table 1 hence directly

contribute in the maximization of the factor f p in Eq. ( 1 ) . Because the instruction

sets and the target frequencies are different in both application domains, the SDR

CGRA has one more pipeline register than the multimedia CGRA, and they are

located at different places in the design.

Traditionally, in VLIWs or in out-of-order superscalar processors, deeper pipelin-

ing results in higher frequencies but also in lower IPCs because of larger branch

missprediction penalties. Following Eq. ( 1 ) , this can result in lower performance.

In CGRAs, however, this is not necessarily the case, as explained in Sect. 3.3.1 .

To illustrate this, Table 3 includes IPCs obtained when generating code for both

CGRAs with and without the pipelining latches.

The benchmarks mapped onto the multimedia ADRES CGRA are a H.264AVC

video decoder, a wavelet-based video decoder, an MPEG4 video coder, a black-and-

white TIFF image filter, and a SHA-2 encryption algorithm. For each application

at most the ten hottest inner loops are included in the table. For the SDR ADRES

CGRA, we selected two baseband modem benchmarks: one WLAN MIMO Channel

Estimation and one that implements the remainder of a WLAN SISO receiver. All

Search WWH ::

Custom Search

Home