Digital Signal Processing Reference
In-Depth Information
whereas this scheduling freedom would be limited to one IS slot in the ADRES
design. To allow this schedule freedom, however, a significant amount of resources
in the form of switches and a special-purpose bus need to be added to the row. While
we lack experimental data to back up this claim, we firmly believe that a similar
increase in schedule freedom can be obtained in the aforementioned 3+1 ADRES
design by simply extending an existing ADRES interconnect with a similar amount
of additional resources. In the ADRES design, that extension would then also be
beneficial to operations other than multiplications.
The optimal number of ISs for a CGRA depends on the application domain, on
the reconfigurability, as well as on the IS functionality and on the DLP available
in the form of subword parallelism. As illustrated in Sect. 4.2.2 , a typical ADRES
would consist of 4
×
4ISs[ 10 , 46 ] . TRIPS also features 4
×
4 ISs. MorphoSys
provides 8
8 ISs, but that is because the DLP is implemented as SIMD over
multiple ISs, rather than as subword parallelism within ISs. In our experience,
scaling dynamically reconfigurable CGRA architectures such as ADRES to very
large arrays (8
×
8 or larger) does not make sense even with scalable interconnects
like mesh or mesh-plus interconnects. Even in loops with high ILP, utilization drops
significantly on such large arrays [ 51 ] . It is not yet clear what is causing this lower
utilization, and there might be several reasons. These include a lack of memory
bandwidth, the possibility that the compiler techniques [ 20 , 48 ] simply do not scale
to such large arrays, or the fact that the relative connectivity in such large arrays is
lower. Simply stated, when a mesh interconnects all ISs to their neighbors, each IS
not on the side of the array is connected to 4 other ISs out of 16 in a 4
×
×
4 array, i.e.,
to 25 % of all ISs, while it is connected to 4 out of 64 ISs on an 8
×
8 array, i.e., to
6.25 % of all ISs.
To finalize this section, we want to mention that, just like in any other type of
processor, it makes sense to pipeline complex combinatorial logic, e.g., as found
in multipliers. There are no fundamental problems to do this, and it can lead to
significant increases in utilization and clock frequency.
3.5
Memory Hierarchies
CGRAs have a large number of ISs that need to be fed with data from the memory.
Therefore the data memory sub-system is a crucial part of the CGRA design. Many
reconfigurable architectures feature multiple independent memory banks or blocks
to achieve high data bandwidth.
The RAW architecture features an independent memory block in each tile for
which Barua developed a method called modulo unrolling to disambiguate and
assign data to different banks [ 4 ] . However, this technique can only handle array
references through affine index expression on loop induction variables.
MorphoSys has a 256-bit wide frame buffer between the main memory and a
reconfigurable array to feed data to the ISs operating in SIMD mode [ 44 ] . The
efficient use of such a wide memory depends by and large on manual data placement
Search WWH ::




Custom Search