Thread-Level Parallelism - Computer Architecture: A Quantitative Approach

Hardware Reference

In-Depth Information

Thus, to achieve a speedup of 80 with 100 processors, only 0.25% of the original

computation can be sequential. Of course, to achieve linear speedup (speedup

of n with n processors), the entire program must usually be parallel with no seri-

al portions. In practice, programs do not just operate in fully parallel or sequen-

tial mode, but often use less than the full complement of the processors when

running in parallel mode.

The second major challenge in parallel processing involves the large latency of remote ac-

cess in a parallel processor. In existing shared-memory multiprocessors, communication of

data between separate cores may cost 35 to 50 clock cycles and among cores on separate chips

anywhere from 100 clock cycles to as much as 500 or more clock cycles (for large-scale mul-

tiprocessors), depending on the communication mechanism, the type of interconnection net-

work, and the scale of the multiprocessor. The effect of long communication delays is clearly

substantial. Let's consider a simple example.

Example

Suppose we have an application running on a 32-processor multiprocessor,

which has a 200 ns time to handle reference to a remote memory. For this ap-

plication, assume that all the references except those involving communication

hit in the local memory hierarchy, which is slightly optimistic. Processors are

stalled on a remote request, and the processor clock rate is 3.3 GHz. If the base

CPI (assuming that all references hit in the cache) is 0.5, how much faster is the

multiprocessor if there is no communication versus if 0.2% of the instructions

involve a remote communication reference?

Answer

It is simpler to first calculate the clock cycles per instruction. The effective CPI

for the multiprocessor with 0.2% remote references is

Search WWH ::

Custom Search

Home