Information Technology Reference
In-Depth Information
10.2.7
Overhead Related to Parallelization
Neither Amdahl's law nor Gustafson-Barsis's law has considered the impact of
overhead on the achievable speedup. To a great extent, the attention on the frac-
tion of non-parallelizable computation is over emphasized by both theories. In many
large-scale problems, however, the actual fraction of non-parallelizable computation
is virtually zero. The real obstacles to achieving high speedup values are different
types of overhead that are associated with the parallelization.
First of all, a parallelized algorithm can introduce additional computations that
were not present in the original serial algorithm. For example, work division often
requires a few arithmetic operations, such as using (10.8)and(10.9). The cost of
partitioning unstructured computational meshes, in particular, is often considerable.
Second, when perfect work division is either impossible or too expensive to achieve,
the resulting work load imbalance will decrease the actual speedup. Third, explicit
synchronization between processors may be needed from time to time. Such opera-
tions can be costly. Fourth, communication overhead can surely not be overlooked
in most parallel applications. For a collective communication involving all P pro-
cessors, the cost is typically of the order
O .d log 2 P e/ ,where de denotes the ceiling
function which gives the smallest integer value that is equal to or larger than log 2 P .
For one-to-one communication on distributed-memory systems, an idealized model
for the overhead of transferring a message with L bytes from one processor to
another is as follows:
t C .L/ D C L;
(10.19)
where is the so-called latency , which represents the start-up time for communi-
cation, and is the cost for transferring one byte of data. By the way, 1= is often
referred to as the bandwidth of the communication network. The parameter values
of and can vary greatly from system to system. The reason for saying that (10.19)
is an idealized model is because it does not consider the possible situation of sev-
eral processors competing for the network, which can reduce the network's effective
capacity. Also, the actual curve of t C .L/ is often of a staircase shape, instead of a
straight line with constant slope. (For example, transferring one double-precision
value can take the same time as transferring a small number of double-precision
values in one message.) Although shared-memory systems seem to be free of one-
to-one communications, there is related overhead behind the scene. For example,
each processor typically has own private cache. We remark that cache is a small
piece of fast memory that dynamically duplicates a small portion of the main mem-
ory. The speed of cache is much faster than that of the main memory, and the use of
cache is meant to overcome the memory-speed bottleneck. A variable in the shared
memory can thus risk being updated differently in several caches. Keeping the dif-
ferent caches coherent between each other can therefore incur costly operations in
the parallel system.
It should be mentioned that there are also factors that may be speedup friendly.
First, many parallel systems have the capability of carrying out communications
at the same time as computations. This allows for the possibility of hiding the
Search WWH ::




Custom Search