Processor Cores - Heterogeneous Multicore Processor Technologies for Embedded Systems

Hardware Reference

In-Depth Information

MOV. L @R0, R1

ADD R1, R2

Load:

ALU:

Conventional Architecture: 2-cycle Stalls

ID

E1

E2

E3

ID

E1 E2

E3

MOV. L @R0, R1

ADD R1, R2

Load:

ALU:

Delayed Execution: 1-cycle Stall

ID

E1

E2

E3

ID

E1

E2

E3

Fig. 3.8

Load-use con fl ict reduction by delayed execution

stages, and the load data is available at the end of the E3 stage. An ALU instruction,

ADD, sets up R1 and R2 values at the ID stage and adds the values at the E1 stage.

Then the load data is forwarded from the E3 stage to the ID stage, and the pipeline

stalls two cycles. With the delayed execution, the load instruction execution is the

same, and the add instruction sets up R1 and R2 values at E1 stage and adds the

values at the E2 stage. Then the load data is forwarded from the E3 stage to the E1

stage, and the pipeline stalls only one cycle, which is the same number of cycle as

that of a five-stage pipeline like SH-4.

There was another choice to start the delayed execution at the E3 stage to avoid

the pipeline stall of the load-use conflict. However, the E3 stage was bad for the

result define. For example, if an ALU result was defined at E3 and an address cal-

culation used the result at E1, it would require three-cycle issue distance between

the instructions for the ALU result and the address calculation. On the other hand, a

program for the SH-4 already considered the one-cycle stall. Therefore, the E2-start

type of the SH-X was considered to be better. Especially, we could expect the pro-

gram optimized for the SH-4 would run on the SH-X properly.

As illustrated in Fig. 3.7 , a store instruction performs an address calculation,

TLB and cache-tag accesses, a store-data latch, and a data store to the cache at the

E1, E2, E4, and E5 stages, respectively, whereas a load instruction performs a cache

access at the E2 stage. This means the three-stage gap of the cache access timing

between the E2 and the E5 stages of a load and a store. However, a load and a store

use the same port of the cache. Therefore, a load instruction gets the priority to a

store instruction if the access is conflicted, and the store instruction must wait the

timing with no conflict. In the N-stage gap case, N entries are necessary for the store

buffer to treat the worst case, which is a sequence of N consecutive store issues fol-

lowed by N consecutive load issues, and the SH-X implemented three entries.

The flexible forwarding enables both an early register release and a late register

allocation and eases the optimization of a program. Figure 3.9 shows the examples

of both the cases. In the early register release case, a floating-point addition instruc-

tion (FADD) generates a result at the end of the E4 stage, and a store instruction

(FMOV) gets the result forwarded from the E5 stage of the FADD. Then the FR1 is

released only one cycle after the allocation, although the FADD takes three cycles

to generate the result. In the late register allocation case, an FADD forwards a result

at the E6 stage, and a transfer instruction (FMOV) gets the forwarded result at the

E1 stage. Then the FR2 allocation is five cycles after the FR1 allocation.

Heterogeneous Multicore Processor Technologies for Embedded Systems

Search WWH ::

Custom Search

Home