Index Structures - Approximate String Processing

Database Reference

In-Depth Information

und

e

nd

o

f

eed

d

o

und

od

ed

d

und

d

od

und

Fig. 5.3 The sux tree for the strings 'food', 'feed', 'fund', and 'found'.

instantiated in the worst case, as opposed to one node for every string

in the sux tree in the worst case.)

The size of the sux tree can be reduced even further by construct-

ing what is called a sux array. The strings in S are concatenated into

one large string (using a special terminating character # ∈ Σ for every

string) and the position of every sux is sorted in an array according

to the lexicographic ordering of the sux. The resulting sorted array

of positions can be searched using binary search on the sux pointed

to by each element in the array. The size of the sux array (since

it contains only sux positions) is very small. The drawback of the

su x array is that the data strings need to be readily available for

comparison purposes, resulting in accesses with poor locality of refer-

ence. Advanced algorithms for improving the eciency of sux arrays

have been proposed, but the structure eventually resembles an inverted

index built over all suxes in S .

Sux tries, sux trees, and sux arrays were designed to be used

primarily for main memory processing. Fairly recently, a variety of

alternatives have been specifically designed for external memory pro-

cessing. Cache-conscious alternatives have also been proposed.

5.2.3

B-trees

B-trees are used for indexing one-dimensional data in secondary stor-

age. A B-tree is a multiway tree that contains n values k 1

k n

that act as comparison keys and n + 1 pointers p 1 ,...,p n +1 , per node.

Pointer p i points to a child node that is the root of a sub-tree con-

taining all data values that lie in the interval ( k i− 1 ,k i ]. Data values

≤ ··· ≤

Approximate String Processing

Search WWH ::

Custom Search

Home