Hardware Reference
In-Depth Information
Time
Processing
P1
P8
W1
P11
CPU #0
Data Transfer
T2
P2
P5
P9
CPU #1
T1
T5
T8
P3
P7
W2
SPP a #0
T4
T7
P4
P6
P10
SPP b #0
T3
T6
Fig. 2.4
Parallel operation
When a program is executed on a heterogeneous multicore, it is divided into
small parts, and each is executed in parallel in the most suitable processor core, as
shown in Fig. 2.4 . Each core processes data on its LM or cache in a Pi period, and
the DTU of a core simultaneously executes a memory-memory data transfer in a Ti
period. For example, CPU #1 processes data on its LM at a P2 period, and its DTU
transfers processed data from the LM of CPU #1 to the LM of SPP b #0 at the
T1 period. After the data transfer, SPP b #0 starts to process data on its LM at a
P6 period. CPU #1 also starts a P5 process that overlaps with the T1 period. In the
parallel operation of Fig. 2.4 , there is a time slot like W1 when the corresponding
core CPU #0 does not need to process or transfer data from the core. During this
time slot, the frequencies of the PU and DTU of CPU #0 can be slowed down or
stopped, or their power supplies can be cut off by control of the connected FVC. As
there are no internal operations of SPP a #0 during the time slot W2, the power of
SPP a #0 can be cut off during this time slot. This FVC control reduces redundant
power consumption of cores and can result in lowering the power consumption of a
heterogeneous multicore chip.
Here, we show an example of our architecture model applied to a heterogeneous
multicore chip. Figure 2.5 is a photograph of the RP-X chip (see Sect. 4.4) [ 3- 5 ] .
Figure 2.6 depicts the internal block diagram. The chip includes eight CPU cores
and seven three-type SPP cores. The CPU (see Sect. 3.1) includes a two-level LM
as well as a 32-KB instruction cache and a 32-KB operand cache. The LM consists
of a 16-KB ILRAM for instruction storage, a 16-KB OLRAM for data storage, and
a 64-KB URAM for instruction and data storage. Each CPU has a local clock pulse
generator (LCPG) that corresponds to the FVC and controls the CPU's clock
frequency independently. The eight CPUs are divided into two clusters. Each
cluster of four CPUs is connected to independent on-chip buses. Additionally,
each cluster has a 256-KB CSM and a DDR3 port which is connected to off-chip
DDR3 DRAMs.
 
Search WWH ::




Custom Search