Hardware Reference
In-Depth Information
for statically allocated pages. Suppose the program is parallelized so that multiple processes
allocate the pages. Because page allocation requires the use of the page table data structure,
which is locked whenever it is in use, even an OS kernel that allows multiple threads in the OS
will be serialized if the processes all try to allocate their pages at once (which is exactly what
we might expect at initialization time!).
This page table serialization eliminates parallelism in initialization and has significant im-
pact on overall parallel performance. This performance botleneck persists even under mul-
tiprogramming. For example, suppose we split the parallel program apart into separate pro-
cesses and run them, one process per processor, so that there is no sharing between the pro-
cesses. (This is exactly what one user did, since he reasonably believed that the performance
problem was due to unintended sharing or interference in his application.) Unfortunately, the
lock still serializes all the processes, so even the multiprogramming performance is poor. This
pitfall indicates the kind of subtle but significant performance bugs that can arise when soft-
ware runs on multiprocessors. Like many other key software components, the OS algorithms
and data structures must be rethought in a multiprocessor context. Placing locks on smal-
ler portions of the page table effectively eliminates the problem. Similar problems exist in
memory structures, which increases the coherence traffic in cases where no sharing is actually
occurring.
As multicore became the dominant theme in everything from desktops to servers, the lack
of an adequate investment in parallel software became apparent. Given the lack of focus, it
will likely be many years before the software systems we use adequately exploit this growing
numbers of cores.
5.10 Concluding Remarks
For more than 30 years, researchers and designers have predicted the end of uniprocessors
and their dominance by multiprocessors. Until the early years of this century, this prediction
was constantly proven wrong. As we saw in Chapter 3 , the costs of trying to find and exploit
more ILP are prohibitive in efficiency (both in silicon area and in power). Of course, multicore
does not solve the power problem, since it clearly increases both the transistor count and the
active number of transistors switching, which are the two dominant contributions to power.
However, multicore does alter the game. By allowing idle cores to be placed in power-
saving mode, some improvement in power efficiency can be achieved, as the results in this
chapter have shown. More importantly, multicore shifts the burden for keeping the processor
busy by relying more on TLP, which the application and programmer are responsible for
identifying, rather than on ILP, for which the hardware is responsible. As we saw, these difer-
ences clearly played out in the multicore performance and energy efficiency of the Java versus
the PARSEC benchmarks.
Although multicore provides some direct help with the energy efficiency challenge and
shifts much of the burden to the software system, there remain difficult challenges and unre-
solved questions. For example, atempts to exploit thread-level versions of aggressive specu-
lation have so far met the same fate as their ILP counterparts. That is, the performance gains
have been modest and are likely less than the increase in energy consumption, so ideas such
as speculative threads or hardware run-ahead have not been successfully incorporated in pro-
cessors. As in speculation for ILP, unless the speculation is almost always right, the costs ex-
ceed the beneits.
In addition to the central problems of programming languages and compiler technology,
multicore has reopened another long-standing question in computer architecture: Is it worth-
 
Search WWH ::




Custom Search