Hardware Reference
In-Depth Information
Hardware Metrics
From a hardware perspective, the performance metrics of interest are the CPU
and I/O speeds and the performance of the interconnection network. The CPU and
I/O speeds are the same as in the uniprocessor case, so the key parameters of inter-
est in a parallel system are those associated with the interconnect. There are two
key items: latency and bandwidth, which we will now look at in turn.
The roundtrip latency is the time it takes for a CPU to send a packet and get a
reply. If the packet is sent to a memory, then the latency measures the time to read
or write a word or block of words. If it is sent to another CPU, it measures the
interprocessor communication time for packets of that size. Usually, the latency of
interest is for minimal packets, often one word or a small cache line.
The latency is built up from several factors and is different for circuit-switched,
store-and-forward, virtual cut through, and wormhole-routed interconnects. For
circuit switching, the latency is the sum of the setup time and the transmission
time. To set up a circuit, a probe packet has to be sent out to reserve the resources
and then report back. Once that has happened, the data packet has to be assem-
bled. When it is ready, bits can flow at full speed, so if the total setup time is T s ,
the packet size is p bits, and the bandwidth b bits/sec, the one-way latency is
T s
p / b . If the circuit is full duplex, then there is no setup time for the reply, so
the minimum latency for sending a p -bit packet and getting a p -bit reply is
T s
+
2 p / b sec.
For packet switching, it is not necessary to send a probe packet to the destina-
tion in advance, but there is still some internal setup time to assemble the packet,
T a . Here the one-way transmission time is T a
+
p / b , but this is only the time to
get the packet into the first switch. There is a finite delay within the switch, say T d
and then the process is repeated to the next switch and so on. The T d delay is com-
posed of both processing time and queueing delay, waiting for the output port to
become free. If there are n switches, then the total one-way latency is given by the
formula T a
+
+
n ( p / b
+
T d )
+
p / b , where the final term is due to the copy from the
last switch to the destination.
The one-way latencies for virtual cut through and wormhole routing in the best
case are close to T a
p / b because there is no probe packet to set up a circuit, and
no store-and-forward delay either. Basically, it is the initial setup time to assemble
the packet, plus the time to push the bits out the door. In all cases, propagation
delay has to be added, but that is usually small.
The other hardware metric is bandwidth. Many parallel programs, especially
in the natural sciences, move a lot of data around, so the number of bytes/sec that
the system can move is critical to performance. Several metrics for bandwidth
exist. We have seen one of them—bisection bandwidth—already. Another is
aggregate bandwidth , which is computed by simply adding up the capacities of
all the links. This number gives the maximum number of bits that can be in transit
at once. Yet another important metric is the average bandwidth out of each CPU.
+
 
Search WWH ::




Custom Search