Java Reference
In-Depth Information
b-trees
19.8
So far, we have assumed that we can store an entire data structure in the main
memory of a computer. Suppose, however, that we have more data than can fit
in main memory, and, as a result, we must have the data structure reside on
disk. When that happens, the rules of the game change, because the Big-Oh
model is no longer meaningful.
The problem is that a Big-Oh analysis assumes that all operations are
equal. However, that is not true, especially when disk I/O is involved. On the
one hand, a 500-MIPS machine supposedly executes 500 million instructions
per second . That is fairly fast, mainly because the speed depends largely on
electrical properties. On the other hand, a disk is mechanical. Its speed
depends largely on the time required to spin the disk and move a disk head.
Many disks spin at 7,200 RPM. Thus in 1 minute, it makes 7,200 revolutions;
hence one revolution occurs in 1/120 of a second, or 8.3 ms. On average we
might expect that we have to spin a disk halfway to find what we are looking
for, but this is compensated by the time to move the disk head, so we get an
access time of 8.3 ms. (This estimate is very charitable; 9 to 11 ms. access
times are more common.) Consequently, we can do approximately 120 disk
accesses per second. This number of accesses sounds good, until we compare
it with the processor speed: We have 500 million instructions versus 120 disk
accesses. Put another way, one disk access is worth about 4,000,000 instruc-
tions. Of course, everything here is a rough calculation, but the relative speeds
are rather clear: Disk accesses are incredibly expensive. Furthermore, proces-
sor speeds are increasing at a much faster rate than disk speeds (it is disk sizes
that are increasing quite quickly). Thus, we are willing to do lots of calcula-
tions just to save a disk access. In almost all cases, the number of disk
accesses dominates the running time. By halving the number of disk accesses,
we can halve the running time.
Here is how the typical search tree performs on disk. Suppose that we want to
access the driving records for citizens in the State of Florida. We assume that we
have 10,000,000 items, that each key is 32 bytes (representing a name), and that a
record is 256 bytes. We assume that this data set does not fit in main memory and
that we are 1 of 20 users on a system (so we have 1/20 of the resources). Thus in
1 sec. we can execute 25 million instructions or perform six disk accesses.
The unbalanced binary search tree is a disaster. In the worst case, it has lin-
ear depth and thus could require 10,000,000 disk accesses. On average a
successful search would require 1.38 log N disk accesses, and as log 10,000,000
is approximately 24, an average search would require 32 disk accesses, or 5 sec.
In a typical randomly constructed tree, we would expect that a few nodes are
three times deeper; they would require about 100 disk accesses, or 16 sec. A
red-black tree is somewhat better: The worst case of 1.44 log N is unlikely to
When data are too
large to fit in mem-
ory, the number of
disk accesses
becomes impor-
tant. A disk access
is unbelievably
expensive com-
pared to a typical
computer
instruction.
Even logarithmic
performance is
unacceptable. We
need to perform
searches in three or
four accesses.
Updates can take
slightly longer.
 
 
Search WWH ::




Custom Search