PARALLEL COMPUTER ARCHITECTURES - Structured Computer Organization

Hardware Reference

In-Depth Information

Hardware Metrics

From a hardware perspective, the performance metrics of interest are the CPU

and I/O speeds and the performance of the interconnection network. The CPU and

I/O speeds are the same as in the uniprocessor case, so the key parameters of inter-

est in a parallel system are those associated with the interconnect. There are two

key items: latency and bandwidth, which we will now look at in turn.

The roundtrip latency is the time it takes for a CPU to send a packet and get a

reply. If the packet is sent to a memory, then the latency measures the time to read

or write a word or block of words. If it is sent to another CPU, it measures the

interprocessor communication time for packets of that size. Usually, the latency of

interest is for minimal packets, often one word or a small cache line.

The latency is built up from several factors and is different for circuit-switched,

store-and-forward, virtual cut through, and wormhole-routed interconnects. For

circuit switching, the latency is the sum of the setup time and the transmission

time. To set up a circuit, a probe packet has to be sent out to reserve the resources

and then report back. Once that has happened, the data packet has to be assem-

bled. When it is ready, bits can flow at full speed, so if the total setup time is T s ,

the packet size is p bits, and the bandwidth b bits/sec, the one-way latency is

T s

p / b . If the circuit is full duplex, then there is no setup time for the reply, so

the minimum latency for sending a p -bit packet and getting a p -bit reply is

T s

+

2 p / b sec.

For packet switching, it is not necessary to send a probe packet to the destina-

tion in advance, but there is still some internal setup time to assemble the packet,

T a . Here the one-way transmission time is T a

+

p / b , but this is only the time to

get the packet into the first switch. There is a finite delay within the switch, say T d

and then the process is repeated to the next switch and so on. The T d delay is com-

posed of both processing time and queueing delay, waiting for the output port to

become free. If there are n switches, then the total one-way latency is given by the

formula T a

+

n ( p / b

+

T d )

+

p / b , where the final term is due to the copy from the

last switch to the destination.

The one-way latencies for virtual cut through and wormhole routing in the best

case are close to T a

p / b because there is no probe packet to set up a circuit, and

no store-and-forward delay either. Basically, it is the initial setup time to assemble

the packet, plus the time to push the bits out the door. In all cases, propagation

delay has to be added, but that is usually small.

The other hardware metric is bandwidth. Many parallel programs, especially

in the natural sciences, move a lot of data around, so the number of bytes/sec that

the system can move is critical to performance. Several metrics for bandwidth

exist. We have seen one of them—bisection bandwidth—already. Another is

aggregate bandwidth , which is computed by simply adding up the capacities of

all the links. This number gives the maximum number of bits that can be in transit

at once. Yet another important metric is the average bandwidth out of each CPU.

+

Structured Computer Organization

Search WWH ::

Custom Search

Home