Coarse-Grained Reconfigurable Array Architectures - Signal Processing Systems

Digital Signal Processing Reference

In-Depth Information

IC p of each phase, and in terms of the time overhead involved in switching between

the phases t p → p + 1 as follows:

1

performance =

IC p

IPC p ·

= p ∈ P

f p +

.

execution time

t p → p + 1

(1)

The operating frequencies f p cannot be increased infinitely because of power-

efficiency reasons. Alternatively, a designer can increase the performance by

designing or selecting a system that can execute code at higher IPCs. In a power-

efficient architecture, a high IPC is reached for the most important phases l

P ,

with L typically consisting of the compute-intensive inner loops, while limiting

their instruction count IC l and reaching a sufficiently high, but still power-efficient

frequency f l . Furthermore, the time overhead t p → p + 1 as well as the corresponding

energy overhead of switching between the execution modes of consecutive phases

should be minimized if such switching happens frequently. Note that such switching

only happens on hardware that supports multiple execution modes in support of

phases with different characteristics.

Course-Grained Reconfigurable Array (CGRA) accelerators aim for these goals

for the inner loops found in many digital signal processing (DSP) domains, includ-

ing multimedia and Software-Defined Radio (SDR) applications. Such applications

have traditionally employed Very Long Instruction Word (VLIW) architectures such

as the TriMedia 3270 [ 74 ] andtheTIC64[ 70 ] , Application-Specific Integrated

Circuits (ASICs), and Application-Specific Instruction Processors (ASIPs). To a

large degree, the reasons for running these applications on VLIW processors also

apply for CGRAs. First of all, a large fraction of the computation time is spent

in manifest nested loops that perform computations on arrays of data and that

can, possibly through compiler transformations, provide a lot of Instruction-Level

Parallelism (ILP). Secondly, most of those inner loops are relatively simple. When

the loops include conditional statements, this can be implement by means of

predication [ 45 ] instead of with complex control flow. Furthermore, none or very

few loops contain multiple exits or continuation points in the form of, e.g., break

or continue statements as in the C-language. Moreover, after inlining the loops

are free of function calls. Finally, the loops are not regular or homogeneous enough

to benefit from vector computing, like on the EVP [ 6 ] oronArdbeg [ 75 ] . When

there is enough regularity and Data-Level Parallelism (DLP) in the loops of an

application, vector computing can typically exploit it more efficiently than what

can be achieved by converting the DLP into ILP and exploiting that on a CGRA.

So in short, CGRAs (with limited DLP support) are ideally suited for applications

of which time-consuming parts have manifest behavior, large amounts of ILP and

limited amounts of DLP.

In the remainder of this chapter, Sect. 2 presents the fundamental properties

of CGRAs. Section 3 gives an overview of the design options for CGRAs. This

overview help designers in evaluating whether or not CGRAs are suited for their

applications and their design requirements, and if so, which CGRA designs are

most suited. After the overview, Sect. 4 presents a case study on the ADRES CGRA

∈

L

⊂

Signal Processing Systems

Search WWH ::

Custom Search

Home