Digital Signal Processing Reference
In-Depth Information
one resource-bound loop, which will result in a lower overall execution time.
Furthermore, less switching between operating modes takes place with fused
loops, and hence the terms t p p + 1 are minimized. Furthermore, less prologues
and epilogues need to be executed, which might also improve performance. This
improvement will usually be limited, however, because the fused prologues and
epilogues will rarely be much shorter than the sum of the original ones. Moreover,
loop fusion does result in a loop that is bigger than any of the original loops, so it
can only be applied if the configuration memory is big enough to fit the fused loop.
If this is the case, less loop configurations need to be stored and possibly reloaded
into the memory.
Interchanging an inner and outer loop serves largely the same purpose as loop
fusion. As loop interchange does not necessarily result in larger prologues and
epilogues, it can be even more useful, as can be the combining of nested loops
into a single loop. Data-context switching [ 8 ] is a very similar technique that
serves the same purpose. That technique has been used by Lee et al. for statically
reconfigurable CGRAs as well [ 42 ] , and in fact most of the loop transformations
mentioned in this section can be used to target such CGRAs, as well as any other
type of CGRA.
Live-in Variables
In our experience, there is only one caveat with the above transformations. The
reason to be careful when applying them is that they can increase the number of
live-in variables. A live-in variable is a variable that gets assigned a value before the
loop, which is consequently used in the loop. Live-in variables can be manifest in the
original source code, but they can also result from compiler optimizations that are
enabled by the above loop transformations, such as induction variable optimizations
and loop-invariant code motion. When the number of live-in variables increases,
more data needs to be passed from the non-loop code to the loop code, which might
have a negative effect on t p p + 1 . The existence and the scale of this effect will
usually depend on the hardware mechanism that couples the CGRA accelerator
to the main core. Possible such mechanisms are discussed in Sect. 3.1 . In tightly-
coupled designs like that of ADRES or Silicon Hive, passing a limited amount of
values from the main CPU mode to the CGRA mode does not involve any overhead:
the values are already present in the shared RF. However, if their number grows too
big, there will not be enough room in the shared RF, which will result in much less
efficient passing of data through memory. We have experienced this several times
with loops in multimedia and SDR applications that were mapped onto our ADRES
designs. So, even for tightly-coupled CGRA designs, the above loop transformations
and the enabled optimizations need to be applied with great care.
Search WWH ::




Custom Search