Coarse-Grained Reconfigurable Array Architectures - Signal Processing Systems - page 542

Digital Signal Processing Reference

In-Depth Information

architecture. This study serves two purposes. First, it illustrates the extent to which

source code needs to be tuned to map well onto CGRA architectures. As we will

show, this is an important aspect of using CGRAs, even when good compiler

support is available and when a very flexible CGRA is targeted, i.e., one that puts

very few restrictions on the loop bodies that it can accelerate. Secondly, our use

case illustrates how Design Space Exploration is necessary to instantiate optimized

designs from parameterizable and customizable architecture templates such as the

ADRES architecture template. Some conclusions are drawn in Sect. 5 .

2

CGRA Basics

CGRAs focus on the efficient execution of the type of loops discussed in the

previous section. By neglecting non-loop code or outer-loop code that is assumed

to be executed on other cores, CGRAs can take the VLIW principles for exploiting

ILP in loops a step further to consume less energy and deliver higher performance,

without compromising on available compiler support. Figures 1 and 2 illustrate this.

Higher performance for high-ILP loops is obtained through two main fea-

tures that separate CGRA architectures from VLIW architectures. First, CGRA

architectures typically provide more Issue Slots (ISs) than typical VLIWs do.

In the CGRA literature some other commonly used terms to denote CGRA

ISs are Arithmetic-Logic Units (ALUs), Functional Units (FUs), or Processing

Elements (PEs). Conceptually, these terms all denote the same: logic on which

an instruction can be executed, typically one per cycle. For example, a typical

ADRES [ 9 - 11 , 20 , 46 , 48 - 50 ] CGRA consists of 16 issue slots, whereas the TI

C64 features 8 slots, and the NXP TriMedia features only 5 slots. The higher

number of ISs directly allows to reach higher IPCs, and hence higher performance,

as indicated by Eq. ( 1 ) . To support these higher IPCs, the bandwidth to memory

is increased by having more load/store ISs than on a typical VLIW, and special

memory hierarchies as found on ASIPs, ASICs, and other DSPs. These include

FIFOs, stream buffers, scratch-pad memories, etc. Secondly, CGRA architectures

typically provide a number of direct connections between the ISs that allow data to

“flow” from one IS to another without needing to pass data through a Register File

RF 0

RF 1

IS 4

IS 5

IS 0

IS 1

IS 2

IS 3

IS 6

IS 7

Fig. 1 An example clustered VLIW architecture with two RFs and eight ISs. Solid directed

edges denote physical connections. Black and white small boxes denote input and output ports,

respectively.

There is a one-to-one mapping between input and output ports and physical

connections

Next Page

Signal Processing Systems

Search WWH ::

Custom Search

Home