Hardware Reference
In-Depth Information
outVCAvailable
per input VC
V
selOutVC[i]
N
reqPort[i]
V:1
arb
V
V
V
V
reqVC[i]
NxV
requests
V:1
arb
V
V
V
total N
V:1 arbiters
Fig. 7.14 An alternative organization of VA1 stage of the VC allocator that offers delay benefits,
under small area overhead. It replaces a mux, one arbiter and a demux with N arbiters that run
in parallel and prepare the output VC requests of each input VC in a form that fits directly the
connections of the arbiters in the VA2 stage
available output VCs and then in VA2 each output VC selects at most one input
VC. The input VCs are informed by the arbiters of VA2 if their request was finally
accepted.
Faster Organization of the VA1 Stage
Implementation results prove that the (de)multiplexing logic at VA1 has a non trivial
contribution to the overall delay of VC allocation. A simple microarchitectural
change can completely eliminate this logic and speedup significantly VC allocation.
The new fast organization of VA1 is shown in Fig. 7.14 .
First all the output VC availability flags of all outputs are masked with the reqVC
vector of each input VC without any pre-selection step. The resulting availability
vectors, e.g., one for each output, are independently arbitrated by V W 1 arbiters
selecting one available VC for each output. From the selected output VCs (one
available VC per output), each input VC needs only one of them; the one that
belongs to the destined output port. Selecting one does not require any multiplexing
but just an additional masking operation with the output port request ( outPort Œi )of
the i th input VC. The selected output VC in all outputs will become zero except
the one that matches the destination output port. Therefore, after this last step, the
output VC request of an input VC is ready and aligned per output as needed by the
output VC arbiters of the second stage. Thus additional demultiplexing/alignment
logic is not needed and significant delay is saved. The cost of this method is that it
replaces a mux ( outVCAvailable multiplexer of Fig. 7.13 ), one arbiter and a demux
(Fig. 7.13 ), with N arbiters that run in parallel and offer faster implementation.
Please notice also that since the outPort Œi request bits are used only after the
V W 1 arbitration step then routing computation can be overlapped in time with the
 
Search WWH ::




Custom Search