Game Development Reference
In-Depth Information
input frame size is not large enough to make those additional costs ignorable, we
convert this intra-frame data independency into instruction-level parallelism,
which can be explored by VLIW or superscalar architecture processors. The
instruction-level parallelism can be explicitly expressed in an executable file,
since the parallelism is available during the compile-time. Both VLIW and
superscalar processors can exploit static instruction-level parallelism. Superscalar
processors use hardware schemes to discover instruction parallelism in a
program, so a superscalar processor can provide backward compatibility for old
generation processors. For this reason, most of general processors are superscalar
processors. On the other hand, a VLIW processor can achieve a similar
performance on a program with explicit parallelism by using significantly less
hardware effort with dedicated compiler support. We use the VLIW processor
to exploit the instruction-level parallelism that resulted from the intra-frame data
independency, since such parallelism can be explicitly expressed at compile time.
In the following, we will introduce our process of converting intra-frame data
independency to instruction-level parallelism. Although the target is a VLIW
processor, most parts of this process can benefit from superscalar processors,
as well.
The first step is to use loop fusion, a way of combining two similar, adjacent loops
for reducing the overhead, and loop unrolling, which partitions the loops to
discover loop-carried dependencies that may let several iterations be executed
at the same time, which increases the basic block size and thus increases
available instruction parallelism. Figure 16 shows examples of loop fusion and
unrolling.
When a loop is executed, there might be dependencies between trips. The
instructions that need to be executed in different trips cannot be executed
simultaneously. The essential idea behind loop fusion and loop unrolling is to
decrease the total number of trips needed to be executed by putting more tasks
in each trip. Loop fusion merges loops together without changing the result of the
executed program. In Figure 16, two loops are merged into one loop. This change
will increase the number of instructions in each trip. Loop unrolling merges
consecutive trips together to reduce the total trip count. In this example, the trip
count is reduced from four to two as loop unrolling is performed. These source
code transformations do not change the execution results, but increase the
number of instructions located in each loop trip and thus increase the number of
instructions that can be executed simultaneously.
Both loop fusion and loop unrolling increase basic block size by merging several
basic blocks together. While loop fusion merges basic blocks in code-domain, in
that different code segments are merged, loop unrolling merges basic blocks in
time-domain, in that different loop iterations are merged. This step increases the
code size for each loop trip. However, we do not observe significant basic block
Search WWH ::




Custom Search