Java Reference
In-Depth Information
stored somewhere for future processing as part of another run. However, because
the heap will shrink by one element in this case, there is now a free space where the
last element of the heap used to be! Thus, replacement selection will slowly shrink
the heap and at the same time use the discarded heap space to store records for the
next run. Once the first run is complete (i.e., the heap becomes empty), the array
will be filled with records ready to be processed for the second run.
Figure 8.8
illustrates part of a run being created by replacement selection.
It should be clear that the minimum length of a run will be M records if the size
of the heap is M, because at least those records originally in the heap will be part of
the run. Under good conditions (e.g., if the input is sorted), then an arbitrarily long
run is possible. In fact, the entire file could be processed as one run. If conditions
are bad (e.g., if the input is reverse sorted), then runs of only size M result.
What is the expected length of a run generated by replacement selection? It
can be deduced from an analogy called the snowplow argument. Imagine that a
snowplow is going around a circular track during a heavy, but steady, snowstorm.
After the plow has been around at least once, snow on the track must be as follows.
Immediately behind the plow, the track is empty because it was just plowed. The
greatest level of snow on the track is immediately in front of the plow, because
this is the place least recently plowed. At any instant, there is a certain amount of
snow S on the track. Snow is constantly falling throughout the track at a steady
rate, with some snow falling “in front” of the plow and some “behind” the plow.
(On a circular track, everything is actually “in front” of the plow, but Figure 8.9
illustrates the idea.) During the next revolution of the plow, all snow S on the track
is removed, plus half of what falls. Because everything is assumed to be in steady
state, after one revolution S snow is still on the track, so 2S snow must fall during
a revolution, and 2S snow is removed during a revolution (leaving S snow behind).
At the beginning of replacement selection, nearly all values coming from the
input file are greater (i.e., “in front of the plow”) than the latest key value output for
this run, because the run's initial key values should be small. As the run progresses,
the latest key value output becomes greater and so new key values coming from the
input file are more likely to be too small (i.e., “after the plow”); such records go to
the bottom of the array. The total length of the run is expected to be twice the size of
the array. Of course, this assumes that incoming key values are evenly distributed
within the key range (in terms of the snowplow analogy, we assume that snow falls
evenly throughout the track). Sorted and reverse sorted inputs do not meet this
expectation and so change the length of the run.
8.5.3 Multiway Merging
The second stage of a typical external sorting algorithm merges the runs created by
the first stage. Assume that we have R runs to merge. If a simple two-way merge
is used, then R runs (regardless of their sizes) will require log R passes through
Search WWH ::




Custom Search