A Glimpse of Parallel Computing - Elements of Scientific Computing

Information Technology Reference

In-Depth Information

10.2.7

Overhead Related to Parallelization

Neither Amdahl's law nor Gustafson-Barsis's law has considered the impact of

overhead on the achievable speedup. To a great extent, the attention on the frac-

tion of non-parallelizable computation is over emphasized by both theories. In many

large-scale problems, however, the actual fraction of non-parallelizable computation

is virtually zero. The real obstacles to achieving high speedup values are different

types of overhead that are associated with the parallelization.

First of all, a parallelized algorithm can introduce additional computations that

were not present in the original serial algorithm. For example, work division often

requires a few arithmetic operations, such as using (10.8)and(10.9). The cost of

partitioning unstructured computational meshes, in particular, is often considerable.

Second, when perfect work division is either impossible or too expensive to achieve,

the resulting work load imbalance will decrease the actual speedup. Third, explicit

synchronization between processors may be needed from time to time. Such opera-

tions can be costly. Fourth, communication overhead can surely not be overlooked

in most parallel applications. For a collective communication involving all P pro-

cessors, the cost is typically of the order

O .d log 2 P e/ ,where de denotes the ceiling

function which gives the smallest integer value that is equal to or larger than log 2 P .

For one-to-one communication on distributed-memory systems, an idealized model

for the overhead of transferring a message with L bytes from one processor to

another is as follows:

t C .L/ D C L;

(10.19)

where is the so-called latency , which represents the start-up time for communi-

cation, and is the cost for transferring one byte of data. By the way, 1= is often

referred to as the bandwidth of the communication network. The parameter values

of and can vary greatly from system to system. The reason for saying that (10.19)

is an idealized model is because it does not consider the possible situation of sev-

eral processors competing for the network, which can reduce the network's effective

capacity. Also, the actual curve of t C .L/ is often of a staircase shape, instead of a

straight line with constant slope. (For example, transferring one double-precision

value can take the same time as transferring a small number of double-precision

values in one message.) Although shared-memory systems seem to be free of one-

to-one communications, there is related overhead behind the scene. For example,

each processor typically has own private cache. We remark that cache is a small

piece of fast memory that dynamically duplicates a small portion of the main mem-

ory. The speed of cache is much faster than that of the main memory, and the use of

cache is meant to overcome the memory-speed bottleneck. A variable in the shared

memory can thus risk being updated differently in several caches. Keeping the dif-

ferent caches coherent between each other can therefore incur costly operations in

the parallel system.

It should be mentioned that there are also factors that may be speedup friendly.

First, many parallel systems have the capability of carrying out communications

at the same time as computations. This allows for the possibility of hiding the

Elements of Scientific Computing

Search WWH ::

Custom Search

Home