Coarse-Grained Reconfigurable Array Architectures - Signal Processing Systems

Digital Signal Processing Reference

In-Depth Information

one resource-bound loop, which will result in a lower overall execution time.

Furthermore, less switching between operating modes takes place with fused

loops, and hence the terms t p → p + 1 are minimized. Furthermore, less prologues

and epilogues need to be executed, which might also improve performance. This

improvement will usually be limited, however, because the fused prologues and

epilogues will rarely be much shorter than the sum of the original ones. Moreover,

loop fusion does result in a loop that is bigger than any of the original loops, so it

can only be applied if the configuration memory is big enough to fit the fused loop.

If this is the case, less loop configurations need to be stored and possibly reloaded

into the memory.

Interchanging an inner and outer loop serves largely the same purpose as loop

fusion. As loop interchange does not necessarily result in larger prologues and

epilogues, it can be even more useful, as can be the combining of nested loops

into a single loop. Data-context switching [ 8 ] is a very similar technique that

serves the same purpose. That technique has been used by Lee et al. for statically

reconfigurable CGRAs as well [ 42 ] , and in fact most of the loop transformations

mentioned in this section can be used to target such CGRAs, as well as any other

type of CGRA.

Live-in Variables

In our experience, there is only one caveat with the above transformations. The

reason to be careful when applying them is that they can increase the number of

live-in variables. A live-in variable is a variable that gets assigned a value before the

loop, which is consequently used in the loop. Live-in variables can be manifest in the

original source code, but they can also result from compiler optimizations that are

enabled by the above loop transformations, such as induction variable optimizations

and loop-invariant code motion. When the number of live-in variables increases,

more data needs to be passed from the non-loop code to the loop code, which might

have a negative effect on t p → p + 1 . The existence and the scale of this effect will

usually depend on the hardware mechanism that couples the CGRA accelerator

to the main core. Possible such mechanisms are discussed in Sect. 3.1 . In tightly-

coupled designs like that of ADRES or Silicon Hive, passing a limited amount of

values from the main CPU mode to the CGRA mode does not involve any overhead:

the values are already present in the shared RF. However, if their number grows too

big, there will not be enough room in the shared RF, which will result in much less

efficient passing of data through memory. We have experienced this several times

with loops in multimedia and SDR applications that were mapped onto our ADRES

designs. So, even for tightly-coupled CGRA designs, the above loop transformations

and the enabled optimizations need to be applied with great care.

Search WWH ::

Custom Search

Home