Digital Signal Processing Reference
In-Depth Information
architecture. This study serves two purposes. First, it illustrates the extent to which
source code needs to be tuned to map well onto CGRA architectures. As we will
show, this is an important aspect of using CGRAs, even when good compiler
support is available and when a very flexible CGRA is targeted, i.e., one that puts
very few restrictions on the loop bodies that it can accelerate. Secondly, our use
case illustrates how Design Space Exploration is necessary to instantiate optimized
designs from parameterizable and customizable architecture templates such as the
ADRES architecture template. Some conclusions are drawn in Sect. 5 .
2
CGRA Basics
CGRAs focus on the efficient execution of the type of loops discussed in the
previous section. By neglecting non-loop code or outer-loop code that is assumed
to be executed on other cores, CGRAs can take the VLIW principles for exploiting
ILP in loops a step further to consume less energy and deliver higher performance,
without compromising on available compiler support. Figures 1 and 2 illustrate this.
Higher performance for high-ILP loops is obtained through two main fea-
tures that separate CGRA architectures from VLIW architectures. First, CGRA
architectures typically provide more Issue Slots (ISs) than typical VLIWs do.
In the CGRA literature some other commonly used terms to denote CGRA
ISs are Arithmetic-Logic Units (ALUs), Functional Units (FUs), or Processing
Elements (PEs). Conceptually, these terms all denote the same: logic on which
an instruction can be executed, typically one per cycle. For example, a typical
ADRES [ 9 - 11 , 20 , 46 , 48 - 50 ] CGRA consists of 16 issue slots, whereas the TI
C64 features 8 slots, and the NXP TriMedia features only 5 slots. The higher
number of ISs directly allows to reach higher IPCs, and hence higher performance,
as indicated by Eq. ( 1 ) . To support these higher IPCs, the bandwidth to memory
is increased by having more load/store ISs than on a typical VLIW, and special
memory hierarchies as found on ASIPs, ASICs, and other DSPs. These include
FIFOs, stream buffers, scratch-pad memories, etc. Secondly, CGRA architectures
typically provide a number of direct connections between the ISs that allow data to
“flow” from one IS to another without needing to pass data through a Register File
RF 0
RF 1
IS 4
IS 5
IS 0
IS 1
IS 2
IS 3
IS 6
IS 7
Fig. 1 An example clustered VLIW architecture with two RFs and eight ISs. Solid directed
edges denote physical connections. Black and white small boxes denote input and output ports,
respectively.
There is a one-to-one mapping between input and output ports and physical
connections
 
 
Search WWH ::




Custom Search