Coarse-Grained Reconfigurable Array Architectures - Signal Processing Systems

Digital Signal Processing Reference

In-Depth Information

ALUs and multipliers with direct connections between them and their local RFs.

These direct connections within each IS can take care of a lot of data transfers,

thus freeing time on the shared bus-based interconnect that connects all ISs. Thus,

the local interconnect within each IS compensates for the lack of a scaling global

interconnect. One advantage of this clustering approach is that the compiler can be

tuned specifically for this combination of local and global connections and for the

fact that it does not need to support heterogeneous ISs. Whether or not this type of

design is more power-efficient than that of CGRAs with more design freedom and

potentially more heterogeneity is unclear at this point in time. At least, we know

of no studies from which, e.g., utilization numbers can be derived that allow us to

compare the two approaches.

Some architectures combine the flexibility of heterogeneous ADRES ISs with

clustering. For example, the CGRA Express [ 57 ] and the expression-grained

reconfigurable array (EGRA) [ 3 ] architectures feature heterogeneous clusters of

relatively simple, fast ALUs. Within the clusters, those ALUs are chained by means

of a limited number of latchless connections. Through careful design, the delay

of those chains is comparable to the delay of other, more complex ISs on the

CGRA that bound the clock frequency. So the chaining does not effect the clock

frequency. It does allow, however, to execute multiple dependent operations within

one clock cycle. It can therefore improve performance significantly. As the chains

and clusters are composed of existing components such as ISs, buses, multiplexers

and connections, these clustered designs do not really extend the design space

of non-clustered CGRAs like ADRES. Still it can be useful to treat clusters as

a separate design level in between the IS component level and the whole array

architecture level, for example because it allows code generation algorithms in

compilers to be tuned for there existence [ 57 ] .

A specific type of clustering was proposed to handle floating-point arithmetic.

While most research on CGRAs is limited to integer and fixed-point arithmetic, Lee

et al. proposed to cluster two ISs to handle floating-point data [ 41 ] . In their design,

both ISs in the cluster can operate independently on integer or fixed-point data, but

they can also cooperate by means of a special direct interconnect between them.

When they cooperate, one IS in the cluster consumes and handles the mantissas,

while the other IS consumes and produces the exponents. As a single ISs can thus be

used for both floating-point and integer computations, Lee et al. are able to achieve

high utilization for integer applications, floating-point applications, as well as mixed

applications.

With respect to utilization, it is clear that the designs of Fig. 7 a , b will only be

utilized well if a lot of multiplications need to be performed. Otherwise, the area-

consuming multipliers remain unused. To work around this problem, the sharing

of large resources such as multipliers between ISs has been proposed in the RSPA

CGRA design [ 33 ] . Figure 7 d depicts one row of ISs that do not contain multipliers

internally, but that are connected to a shared multiplier through switches and a

shared bus. The advantage of this design, compared to an ADRES design in which

each row features three pure ALU ISs and one ALU+MULT IS, is that this design

allows the compiler to schedule multiplications in all ISs (albeit only one per cycle),

Signal Processing Systems

Search WWH ::

Custom Search

Home