Hardware Reference
In-Depth Information
correctly this body flit driven by the stored grant signals at the corresponding output.
The tail flit in cycle 4 repeats the same procedure and moves to ST without waiting
for any SA grants. The head flit of the new incoming packet reaches the frontmost
position of the input buffer in cycle 5 initiating a new round of RC, SA and ST
operations, similar to the head flit of the previous packet.
This stored grants approach turns the previously presented “elementary” SA
pipeline an obsolete choice. It reduces bubbles significantly with only minimal
delay overhead to the router's control path. It also looks as if it simplifies the
allocation procedure. However, in essence, it simply adds extra state registers to
the arbiter's path and a multiplexer at both control and data paths. Therefore, this
approach is avoided in single cycle version, and the original request generation logic
is preferred. For the same reason, the stored grants, will not be used at the next
pipeline SA configuration that uses a pipeline register both in the control and in the
datapath although it could have been a possible choice. The stored-grants approach
for the body and tail flits is a useful pipeline alternative when SA is separated from
ST solely in the control path.
5.3.3
Idle-Cycle Free Operation of the SA Pipeline Stage
The dependencies arising from delaying the delivery of the grants of SA to the
crossbar and to the inputs of the routers can be alternatively resolved by adding an
extra input pipeline register to the data path. The added data pipeline register, shown
in Fig. 5.13 , does not have to be flow-controlled since no flit will ever stall in this
position. This data pipeline register is just used to align the arrival of the registered
grant signals with the arrival of the corresponding flit to the input of the crossbar.
Since the grant signals should be always aligned to the corresponding data, the
delivery of the grant signals to the inputs should move before the grant pipeline
register (as done in Fig. 5.13 ). This is needed since the dequeued data will reach
the input of the crossbar one cycle later; they will spend one cycle passing the data
pipeline register. This extra cycle also requires an extra buffer slot at the output
buffer (at the input of the next router) for full throughput operation, since forward
latency L f is increased by 1.
The pipeline flow diagram that corresponds to the pipelined organization of
Fig. 5.13 is shown in Fig. 5.14 . In this case, the head flit in cycle 1 performs RC
and SA and after accepting the grants in the same cycle it is dequeued and moves
to the data pipeline register after having consumed the necessary credit. In cycle 2,
the head flits leaves the data pipeline register and moves to the selected output using
the grants produced by the corresponding output arbiter in the previous cycle. In
the same cycle, the following body flit that arrived in cycle 1 and is placed now
in the frontmost position of the input buffer, can perform SA and once granted
it can move also to the data pipeline register consuming in parallel the necessary
downstream credit. The same holds for the following tail flit that can perform all
needed operations without experiencing any idle cycles. In cycle 4, the full overlap
Search WWH ::




Custom Search