Thread-Level Parallelism - Computer Architecture: A Quantitative Approach

Hardware Reference

In-Depth Information

for statically allocated pages. Suppose the program is parallelized so that multiple processes

allocate the pages. Because page allocation requires the use of the page table data structure,

which is locked whenever it is in use, even an OS kernel that allows multiple threads in the OS

will be serialized if the processes all try to allocate their pages at once (which is exactly what

we might expect at initialization time!).

This page table serialization eliminates parallelism in initialization and has significant im-

pact on overall parallel performance. This performance botleneck persists even under mul-

tiprogramming. For example, suppose we split the parallel program apart into separate pro-

cesses and run them, one process per processor, so that there is no sharing between the pro-

cesses. (This is exactly what one user did, since he reasonably believed that the performance

problem was due to unintended sharing or interference in his application.) Unfortunately, the

lock still serializes all the processes, so even the multiprogramming performance is poor. This

pitfall indicates the kind of subtle but significant performance bugs that can arise when soft-

ware runs on multiprocessors. Like many other key software components, the OS algorithms

and data structures must be rethought in a multiprocessor context. Placing locks on smal-

ler portions of the page table effectively eliminates the problem. Similar problems exist in

memory structures, which increases the coherence traffic in cases where no sharing is actually

occurring.

As multicore became the dominant theme in everything from desktops to servers, the lack

of an adequate investment in parallel software became apparent. Given the lack of focus, it

will likely be many years before the software systems we use adequately exploit this growing

numbers of cores.

5.10 Concluding Remarks

For more than 30 years, researchers and designers have predicted the end of uniprocessors

and their dominance by multiprocessors. Until the early years of this century, this prediction

was constantly proven wrong. As we saw in Chapter 3 , the costs of trying to find and exploit

more ILP are prohibitive in efficiency (both in silicon area and in power). Of course, multicore

does not solve the power problem, since it clearly increases both the transistor count and the

active number of transistors switching, which are the two dominant contributions to power.

However, multicore does alter the game. By allowing idle cores to be placed in power-

saving mode, some improvement in power efficiency can be achieved, as the results in this

chapter have shown. More importantly, multicore shifts the burden for keeping the processor

busy by relying more on TLP, which the application and programmer are responsible for

identifying, rather than on ILP, for which the hardware is responsible. As we saw, these difer-

ences clearly played out in the multicore performance and energy efficiency of the Java versus

the PARSEC benchmarks.

Although multicore provides some direct help with the energy efficiency challenge and

shifts much of the burden to the software system, there remain difficult challenges and unre-

solved questions. For example, atempts to exploit thread-level versions of aggressive specu-

lation have so far met the same fate as their ILP counterparts. That is, the performance gains

have been modest and are likely less than the increase in energy consumption, so ideas such

as speculative threads or hardware run-ahead have not been successfully incorporated in pro-

cessors. As in speculation for ILP, unless the speculation is almost always right, the costs ex-

ceed the beneits.

In addition to the central problems of programming languages and compiler technology,

multicore has reopened another long-standing question in computer architecture: Is it worth-

Computer Architecture: A Quantitative Approach

Search WWH ::

Custom Search

Home