Java Reference
In-Depth Information
for i =
1 to N do
for j =
1 to N do
A [ i , j ]
0
for k =
1 to N do
7
A [ i , j ]
A [ i , j ]
+ B [ i , k ]
× C [ k , j ]
8
Figure 14.3: Fusing the loop nests.
the benefits of WPO. Even if a method cannot be inlined, WPO can generate a
version of the invokedmethod that is customized to its calling context. In other
words, the trend toward reusable code can result in systems that are general
but not as e
cient as they could be. Optimizing compilers try to regain some
of the lost performance.
Target-Specific Optimization
Portability ranks high among the goals of today's programming languages.
Ideally, once a program is written in a high-level language, it should be
movable without modification to any computing system that supports the
language. Architected interpretive languages such as the Java Virtual Ma-
chine (JVM) support portability nicely. Any computer that hosts a JVM in-
terpreter can run any Java program. But the question remains—how fast?
Although most modern computers are based on RISC principles, the details of
their instruction sets vary widely. Moreover, various models for a given archi-
tecture can also di
er with regard to their storage hierarchies, their instruction
timings, and their degree of concurrency.
ff
Continuing with our example, the program shown in Figure 14.2 is an
improvement over the version shown in Figure 14.1, but it is possible to obtain
still better performance. Consider the behavior of thematrix Result .Thevalues
of Result are computed in the loop nest at Markers 3 and 4 —one element
at a time. Each of these elements is then copied from Result to A by the loop
nest at Marker 6 . Poor performance can be expected on any non-uniform
memory access (NUMA) system that cannot accommodate Result at its fastest
storage level. Better performance can be obtained if the data is accumulated
in a register and then stored directly in A . Optimizing compilers that feature
loop fusion can identify that the outer two loop nests at Markers 3 and 6
are structurally equivalent. Dependence analysis can show that each element
of Result is computed independently. The loops can then be fused to obtain
the program shown in Figure 14.3 in which the Result matrix is eliminated.
 
Search WWH ::




Custom Search