Accelerating Queries on Very Large Datasets - Scientific Data Management

Database Reference

In-Depth Information

is not space ecient. A more space-ecient implementation of a trie index

is the PATRICIA index. 47 , 58 PATRICIA stands for practical algorithm to

retrieve information coded in alphanumeric. Tries are the fundamental index

schemes for string-oriented databases, for example, very large text databases

used in information-retrieval problems, genome sequence databases, and bio-

informatics. Tries can be perceived as generic index methods with variants

such as Su x-Trees , String B-Trees , and Burst Tries . 31 , 37 , 39 , 43 , 47 , 99

6.4.3 Hashing Schemes

Hashing methods are mechanisms for accessing records by the address of a

record. The address is computed using a key of the record, usually the primary

key or some unique combination of attribute values of the record. The method

involves a set of n disk blocks, also termed buckets , with index values numbered

from 0 to n

1. A bucket is a unit of storage containing one or more records.

For each record

−

k i , r i

whose key is k i and entire record is r i , a hash function

H

()

is used to compute the bucket address I where the record is stored, that

is, I

. The result I of computing the hash function forms the index

into the array of buckets or data blocks that hold the records. A hashing

scheme can either be static or dynamic .

In static hashing, the number of buckets is preallocated. Records with dif-

ferent search-key values may map to the same bucket, in which case we say

that a collision occurs. To locate a record in a bucket with multiple keys, the

entire bucket has to be searched sequentially. Occasionally a bucket may have

more records that hash into it than it can contain. When this occurs it is

handled by invoking some overflow handling methods. Typical overflow han-

dling methods include separate chaining, rehashing, coalesced chaining, and

so forth. To minimize the occurrence of overflow, a hash function has to be

chosen to distribute the keys uniformly over the allocated buckets. Hashing

methods on single attributes are discussed extensively in the literature. 25 , 47 , 79

An alternative to static hashing is dynamic hashing. Dynamic hashing uses

a dynamically changing function that allows the addressed space, that is, the

number of allocated buckets, to grow and shrink with insertions and dele-

tions of the records, respectively. It embeds the handling of overflow records

dynamically into its primary address space to avoid explicit management of

overflow buckets. Various techniques for dynamic hashing have been proposed.

Notable among these are dynamic hashing, 51 linear hashing, 54 and extendible

hashing. 30 , 68

=

H

(

k i )

6.5 Multidimensional Index Schemes

Multidimensional indexing arises from the need for ecient query processing

in numerous application domains where objects are generally characterized

by feature vectors . They arise naturally from representation and querying of

Scientific Data Management

Search WWH ::

Custom Search

Home