Binary search trees - Data Structures and Problem Solving Using Java

Java Reference

In-Depth Information

b-trees

19.8

So far, we have assumed that we can store an entire data structure in the main

memory of a computer. Suppose, however, that we have more data than can fit

in main memory, and, as a result, we must have the data structure reside on

disk. When that happens, the rules of the game change, because the Big-Oh

model is no longer meaningful.

The problem is that a Big-Oh analysis assumes that all operations are

equal. However, that is not true, especially when disk I/O is involved. On the

one hand, a 500-MIPS machine supposedly executes 500 million instructions

per second . That is fairly fast, mainly because the speed depends largely on

electrical properties. On the other hand, a disk is mechanical. Its speed

depends largely on the time required to spin the disk and move a disk head.

Many disks spin at 7,200 RPM. Thus in 1 minute, it makes 7,200 revolutions;

hence one revolution occurs in 1/120 of a second, or 8.3 ms. On average we

might expect that we have to spin a disk halfway to find what we are looking

for, but this is compensated by the time to move the disk head, so we get an

access time of 8.3 ms. (This estimate is very charitable; 9 to 11 ms. access

times are more common.) Consequently, we can do approximately 120 disk

accesses per second. This number of accesses sounds good, until we compare

it with the processor speed: We have 500 million instructions versus 120 disk

accesses. Put another way, one disk access is worth about 4,000,000 instruc-

tions. Of course, everything here is a rough calculation, but the relative speeds

are rather clear: Disk accesses are incredibly expensive. Furthermore, proces-

sor speeds are increasing at a much faster rate than disk speeds (it is disk sizes

that are increasing quite quickly). Thus, we are willing to do lots of calcula-

tions just to save a disk access. In almost all cases, the number of disk

accesses dominates the running time. By halving the number of disk accesses,

we can halve the running time.

Here is how the typical search tree performs on disk. Suppose that we want to

access the driving records for citizens in the State of Florida. We assume that we

have 10,000,000 items, that each key is 32 bytes (representing a name), and that a

record is 256 bytes. We assume that this data set does not fit in main memory and

that we are 1 of 20 users on a system (so we have 1/20 of the resources). Thus in

1 sec. we can execute 25 million instructions or perform six disk accesses.

The unbalanced binary search tree is a disaster. In the worst case, it has lin-

ear depth and thus could require 10,000,000 disk accesses. On average a

successful search would require 1.38 log N disk accesses, and as log 10,000,000

is approximately 24, an average search would require 32 disk accesses, or 5 sec.

In a typical randomly constructed tree, we would expect that a few nodes are

three times deeper; they would require about 100 disk accesses, or 16 sec. A

red-black tree is somewhat better: The worst case of 1.44 log N is unlikely to

When data are too

large to fit in mem-

ory, the number of

disk accesses

becomes impor-

tant. A disk access

is unbelievably

expensive com-

pared to a typical

computer

instruction.

Even logarithmic

performance is

unacceptable. We

need to perform

searches in three or

four accesses.

Updates can take

slightly longer.

Data Structures and Problem Solving Using Java

Search WWH ::

Custom Search

Home