Digital Signal Processing Reference
In-Depth Information
and operation scheduling. Similar techniques for wide loads and stores have also
been proposed in regular VLIW architectures for reducing power [ 60 ] . Exploiting
that hardware requires manual data layout optimizations as well.
Both Silicon Hive and PACT feature distributed memory blocks without a
crossbar. A Silicon Hive programmer has to specify the allocation of data to
the memory for the compiler to bind the appropriate load/store operations to the
corresponding memories. Silicon Hive also supports the possibility of interfacing
the memory or system bus using FIFO interfaces. This is efficient for streaming
processing but is difficult to interface when the data needs to be buffered on in case
of data reuse.
The ADRES architecture template provides a parameterizable Data Memory
Queue (DMQ) interface to each of the different single-ported, interleaved level-1
scratch-pad memory banks [ 19 ] . The DMQ interface is responsible for resolving
bank access conflicts, i.e., when multiple load/store ISs would want to access the
same bank at the same time. Connecting all load/store ISs to all banks through a
conflict resolution mechanism allows maximal freedom for data access patterns and
also maximal freedom on the data layout in memory. The potential disadvantage of
such conflict resolution is that it increases the latency of load operations. In software
pipelined code, however, increasing the individual latency of instructions most often
does not have a negative effect on the schedule quality, because the compiler can
hide those latencies in the software pipeline. In the main processor VLIW mode of
an ADRES, the same memories are accessed in code that is not software-pipelined.
So in that mode, the conflict resolution is disabled to obtain shorter access latencies.
Alternatively, a data cache can be added to the memory hierarchy to complement
the scratch-pad memories. By letting the compiler partition the data over the scratch-
pad memories and the data cache in an appropriate manner, high throughput can be
obtained in the CGRA mode, as well as low latency in the VLIW mode [ 32 ] .
3.6
Compiler Support
Apart from the specific algorithms used to compile code, the major distinctions
between the different existing compiler techniques relate to whether or not they
support static scheduling, whether or not they support dynamic reconfiguration,
whether or not they rely on special programming languages, and whether or not
they are limited to specific hardware properties. Because most compiler research has
been done to generate static schedules for CGRAs, we focus on those in this section.
As already indicated in Sects. 3.2.1 and 3.2.2 , many algorithms are based on FPGA
placement and routing techniques [ 7 ] in combination with VLIW code generation
techniques like modulo scheduling [ 39 , 61 ] and hyperblock formation [ 45 ] .
Whether or not compiler techniques rely on specific hardware properties is
not always obvious in the literature, as not enough details are available in the
descriptions of the techniques, and few techniques have been tried on a wide range
Search WWH ::




Custom Search