Coarse-Grained Reconfigurable Array Architectures - Signal Processing Systems

Digital Signal Processing Reference

In-Depth Information

whereas this scheduling freedom would be limited to one IS slot in the ADRES

design. To allow this schedule freedom, however, a significant amount of resources

in the form of switches and a special-purpose bus need to be added to the row. While

we lack experimental data to back up this claim, we firmly believe that a similar

increase in schedule freedom can be obtained in the aforementioned 3+1 ADRES

design by simply extending an existing ADRES interconnect with a similar amount

of additional resources. In the ADRES design, that extension would then also be

beneficial to operations other than multiplications.

The optimal number of ISs for a CGRA depends on the application domain, on

the reconfigurability, as well as on the IS functionality and on the DLP available

in the form of subword parallelism. As illustrated in Sect. 4.2.2 , a typical ADRES

would consist of 4

×

4ISs[ 10 , 46 ] . TRIPS also features 4

×

4 ISs. MorphoSys

provides 8

8 ISs, but that is because the DLP is implemented as SIMD over

multiple ISs, rather than as subword parallelism within ISs. In our experience,

scaling dynamically reconfigurable CGRA architectures such as ADRES to very

large arrays (8

×

8 or larger) does not make sense even with scalable interconnects

like mesh or mesh-plus interconnects. Even in loops with high ILP, utilization drops

significantly on such large arrays [ 51 ] . It is not yet clear what is causing this lower

utilization, and there might be several reasons. These include a lack of memory

bandwidth, the possibility that the compiler techniques [ 20 , 48 ] simply do not scale

to such large arrays, or the fact that the relative connectivity in such large arrays is

lower. Simply stated, when a mesh interconnects all ISs to their neighbors, each IS

not on the side of the array is connected to 4 other ISs out of 16 in a 4

×

4 array, i.e.,

to 25 % of all ISs, while it is connected to 4 out of 64 ISs on an 8

×

8 array, i.e., to

6.25 % of all ISs.

To finalize this section, we want to mention that, just like in any other type of

processor, it makes sense to pipeline complex combinatorial logic, e.g., as found

in multipliers. There are no fundamental problems to do this, and it can lead to

significant increases in utilization and clock frequency.

3.5

Memory Hierarchies

CGRAs have a large number of ISs that need to be fed with data from the memory.

Therefore the data memory sub-system is a crucial part of the CGRA design. Many

reconfigurable architectures feature multiple independent memory banks or blocks

to achieve high data bandwidth.

The RAW architecture features an independent memory block in each tile for

which Barua developed a method called modulo unrolling to disambiguate and

assign data to different banks [ 4 ] . However, this technique can only handle array

references through affine index expression on loop induction variables.

MorphoSys has a 256-bit wide frame buffer between the main memory and a

reconfigurable array to feed data to the ISs operating in SIMD mode [ 44 ] . The

efficient use of such a wide memory depends by and large on manual data placement

Signal Processing Systems

Search WWH ::

Custom Search

Home