Hardware Reference
In-Depth Information
Thus, to achieve a speedup of 80 with 100 processors, only 0.25% of the original
computation can be sequential. Of course, to achieve linear speedup (speedup
of n with n processors), the entire program must usually be parallel with no seri-
al portions. In practice, programs do not just operate in fully parallel or sequen-
tial mode, but often use less than the full complement of the processors when
running in parallel mode.
The second major challenge in parallel processing involves the large latency of remote ac-
cess in a parallel processor. In existing shared-memory multiprocessors, communication of
data between separate cores may cost 35 to 50 clock cycles and among cores on separate chips
anywhere from 100 clock cycles to as much as 500 or more clock cycles (for large-scale mul-
tiprocessors), depending on the communication mechanism, the type of interconnection net-
work, and the scale of the multiprocessor. The effect of long communication delays is clearly
substantial. Let's consider a simple example.
Example
Suppose we have an application running on a 32-processor multiprocessor,
which has a 200 ns time to handle reference to a remote memory. For this ap-
plication, assume that all the references except those involving communication
hit in the local memory hierarchy, which is slightly optimistic. Processors are
stalled on a remote request, and the processor clock rate is 3.3 GHz. If the base
CPI (assuming that all references hit in the cache) is 0.5, how much faster is the
multiprocessor if there is no communication versus if 0.2% of the instructions
involve a remote communication reference?
Answer
It is simpler to first calculate the clock cycles per instruction. The effective CPI
for the multiprocessor with 0.2% remote references is
Search WWH ::




Custom Search