Digital Signal Processing Reference
In-Depth Information
Predication
Modulo scheduling techniques for CGRAs [ 20 , 22 , 25 , 48 , 54 , 55 ] only schedule
loops that are free of control flow transfers. Hence any loop body that contains
conditional statements first needs to be if-converted into hyperblocks by means of
predication [ 45 ] . For this reason, many CGRAs, including ADRES CGRAs, support
predication.
Hyperblock formation can result in very inefficient code if a loop body contains
code paths that are executed rarely. All those paths contribute to ResMII and
potentially to RecMII . Hence even paths that get executed very infrequently can
slow down a whole modulo-scheduled loop. Such loops can be detected with
profiling, and if the data dependencies allow this, it can be useful to split these loops
into multiple loops. For example, a first loop can contain the code of the frequently
executed paths only, with a lower II than the original loop. If it turns out during the
execution of this loop that in some iteration the infrequently executed code needs to
be executed, the first loop is exited, and for the remaining iterations a second loop is
entered that includes both the frequently and the infrequently executed code paths.
Alternatively, for some loops it is beneficial to have a so-called inspector loop
with very small II to perform only the checks for all iterations. If none of the
checks are positive, a second so-called executor loop is executed that includes all
the computations except the checks and the infrequently executed paths. If some
checks were positive, the original loop is executed.
One caveat with this loop splitting is that it causes code size expansion in the
CGRA instruction memories. For power consumption reasons, these memories are
kept as small as possible. This means that the local improvements obtained with the
loop splitting need to be balanced with the total code size of all loops that need to
share these memories.
Kernel-Only Loops
Predication can also be used to generate so-called kernel-only loop code. This is
loop code that does not have separate prologue and epilogue code fragments. Instead
the prologues and epilogues are included in the kernel itself, where predication
is now used to guard whole software pipeline stages and to ensure that only the
appropriate software pipeline stages are activated at each point in time. A traditional
loop with a separate prologue and epilogue is compared to a kernel-only loop in
Fig. 8 . Three observations need to be made here.
The first observation is that kernel-only code is usually faster because the pipeline
stages of the prologue and epilogue now get executed on the CGRA accelerator,
which typically can do so at much higher IPCs than the main core. This is a major
difference between (ADRES) CGRAs and VLIWs. On the latter, kernel-only loops
are much less useful because all code runs on the same number of ISs anyway.
Secondly, while kernel-only code will be faster on CGRAs, more time is spent
in the CGRA mode, as can be seen in Fig. 8 . During the epilogue and prologue, the
Search WWH ::




Custom Search