Hardware Reference
In-Depth Information
MOV. L @R0, R1
ADD R1, R2
Load:
ALU:
Conventional Architecture: 2-cycle Stalls
ID
E1
E2
E3
ID
E1 E2
E3
MOV. L @R0, R1
ADD R1, R2
Load:
ALU:
Delayed Execution: 1-cycle Stall
ID
E1
E2
E3
ID
E1
E2
E3
Fig. 3.8
Load-use con fl ict reduction by delayed execution
stages, and the load data is available at the end of the E3 stage. An ALU instruction,
ADD, sets up R1 and R2 values at the ID stage and adds the values at the E1 stage.
Then the load data is forwarded from the E3 stage to the ID stage, and the pipeline
stalls two cycles. With the delayed execution, the load instruction execution is the
same, and the add instruction sets up R1 and R2 values at E1 stage and adds the
values at the E2 stage. Then the load data is forwarded from the E3 stage to the E1
stage, and the pipeline stalls only one cycle, which is the same number of cycle as
that of a five-stage pipeline like SH-4.
There was another choice to start the delayed execution at the E3 stage to avoid
the pipeline stall of the load-use conflict. However, the E3 stage was bad for the
result define. For example, if an ALU result was defined at E3 and an address cal-
culation used the result at E1, it would require three-cycle issue distance between
the instructions for the ALU result and the address calculation. On the other hand, a
program for the SH-4 already considered the one-cycle stall. Therefore, the E2-start
type of the SH-X was considered to be better. Especially, we could expect the pro-
gram optimized for the SH-4 would run on the SH-X properly.
As illustrated in Fig. 3.7 , a store instruction performs an address calculation,
TLB and cache-tag accesses, a store-data latch, and a data store to the cache at the
E1, E2, E4, and E5 stages, respectively, whereas a load instruction performs a cache
access at the E2 stage. This means the three-stage gap of the cache access timing
between the E2 and the E5 stages of a load and a store. However, a load and a store
use the same port of the cache. Therefore, a load instruction gets the priority to a
store instruction if the access is conflicted, and the store instruction must wait the
timing with no conflict. In the N-stage gap case, N entries are necessary for the store
buffer to treat the worst case, which is a sequence of N consecutive store issues fol-
lowed by N consecutive load issues, and the SH-X implemented three entries.
The flexible forwarding enables both an early register release and a late register
allocation and eases the optimization of a program. Figure 3.9 shows the examples
of both the cases. In the early register release case, a floating-point addition instruc-
tion (FADD) generates a result at the end of the E4 stage, and a store instruction
(FMOV) gets the result forwarded from the E5 stage of the FADD. Then the FR1 is
released only one cycle after the allocation, although the FADD takes three cycles
to generate the result. In the late register allocation case, an FADD forwards a result
at the E6 stage, and a transfer instruction (FMOV) gets the forwarded result at the
E1 stage. Then the FR2 allocation is five cycles after the FR1 allocation.
Search WWH ::




Custom Search