Hardware Reference
In-Depth Information
the packet can be sent. Once the buffer for the entire packet has been allocated,
the channel resource can be allocated on flit granularity. A multi-flit packet can be
interrupted during transmission from input to output; the packet will not necessarily
be sent continuously. However, when the head flit arrives at an input, it reserves the
next L slots in the output buffer such that the whole packet to be kept at the output
in the case of downstream stall.
Even if using flit-level flow control, buffers can be allocated at the packet level
by employing atomic buffer allocation. In this case, the head flit of a packet is not
buffered behind the tail flit of another packet in the same buffer. In effect, buffers
are implicitly allocated on packet granularity, even if flit-based flow control is used.
This operation can be achieved by not releasing the outAvailable flag when the tail
flit arrives at the output buffer but when it leaves the output buffer. In this way, when
the next head flit arrives, it will find the output buffer empty. In every case that the
buffers are allocated at the packet level the amount of buffers required is equal to the
size of the longest packet, which inevitably leads to low buffer utilization for short
packets.
In the following we adopt the non-atomic buffer allocation principles. However,
in any case that atomic buffer allocation is needed the aforementioned rules can be
applied to enforce it.
3.1.3
Hierarchical Switching
Arbitration and multiplexing for reaching the output link can be performed hierar-
chically by merging at each step a group of inputs and allowing one flit from them
to progress to the output. An example of a hierarchical 1-output switch organization
is depicted in Fig. 3.5 a. The main difference of hierarchical switching relative to
single-step switching is that at each step a 2-input arbiter and a 2-to-1 multiplexer
is enough to switch the flits between two inputs, while in the single-step case the
arbiter and the multiplexer employed should have as many inputs as the inputs of
the whole switching module.
To achieve maximum flexibility and increase the throughput of the system by
allowing multiple packets to move in parallel closer to the output, we should modify
also the request generation logic of the baseline design. In the baseline case, every
input before issuing a request to the arbiter qualified its valid signal with the
outAvailable flag of the output and then masked the result with the ready signal
of the output buffer (see Fig. 3.2 ). In the hierarchical implementation this is not
possible since there is no global arbiter to check the requests of all inputs. Instead,
we assume that each merging point can be considered as a partial output and has its
own outAvailable flag. In this way, at each merging point, we can use unchanged the
allocation and multiplexing logic designed for the baseline case (Fig. 3.2 ) including
also the outLock variable at the input of each merging step.
Search WWH ::




Custom Search