Indexing - Data Structures and Algorithm Analysis

Java Reference

In-Depth Information

neighboring sibling, if possible. If the sibling is also full, then these two nodes split

into three. Similarly, when a node underflows, it is combined with its two siblings,

and the total reduced to two nodes. Thus, the nodes are always at least two thirds

full. 2

10.5.2

B-Tree Analysis

The asymptotic cost of search, insertion, and deletion of records from B-trees,

B + -trees, and B -trees is (log n) where n is the total number of records in the

tree. However, the base of the log is the (average) branching factor of the tree.

Typical database applications use extremely high branching factors, perhaps 100 or

more. Thus, in practice the B-tree and its variants are extremely shallow.

As an illustration, consider a B + -tree of order 100 and leaf nodes that contain

up to 100 records. A B + -tree with height one (that is, just a single leaf node) can

have at most 100 records. A B + -tree with height two (a root internal node whose

children are leaves) must have at least 100 records (2 leaves with 50 records each).

It has at most 10,000 records (100 leaves with 100 records each). A B + -tree with

height three must have at least 5000 records (two second-level nodes with 50 chil-

dren containing 50 records each) and at most one million records (100 second-level

nodes with 100 full children each). A B + -tree with height four must have at least

250,000 records and at most 100 million records. Thus, it would require an ex-

tremely large database to generate a B + -tree of more than height four.

The B + -tree split and insert rules guarantee that every node (except perhaps the

root) is at least half full. So they are on average about 3=4 full. But the internal

nodes are purely overhead, since the keys stored there are used only by the tree to

direct search, rather than store actual data. Does this overhead amount to a signifi-

cant use of space? No, because once again the high fan-out rate of the tree structure

means that the vast majority of nodes are leaf nodes. Recall (from Section 6.4) that

a full K-ary tree has approximately 1=K of its nodes as internal nodes. This means

that while half of a full binary tree's nodes are internal nodes, in a B + -tree of order

100 probably only about 1=75 of its nodes are internal nodes. This means that the

overhead associated with internal nodes is very low.

We can reduce the number of disk fetches required for the B-tree even more

by using the following methods. First, the upper levels of the tree can be stored in

main memory at all times. Because the tree branches so quickly, the top two levels

(levels 0 and 1) require relatively little space. If the B-tree is only height four, then

2 This concept can be extended further if higher space utilization is required. However, the update

routines become much more complicated. I once worked on a project where we implemented 3-for-4

node split and merge routines. This gave better performance than the 2-for-3 node split and merge

routines of the B -tree. However, the spitting and merging routines were so complicated that even

their author could no longer understand them once they were completed!

Search WWH ::

Custom Search

Home