Accelerating Queries on Very Large Datasets - Scientific Data Management

Database Reference

In-Depth Information

These chunking methods give rise to typical formats for array files . Examples

of array files include the file formats known as NetCDF (network common data

format) 61 and HDF4/HDF5 (hierarchical data format). 42 We briefly explore

some of the widely used indexing methods that are based on tessellation of

the data space.

6.5.1 Multidimensional Tree-Structured Indexing Methods

From the general approach of designing multidimensional index schemes, we

examine now some special cases that lead to the class of tree-structured mul-

tidimensional indexing.

6.5.1.1 K-D-Tree and LSD-Tree

The K-D-Tree 11 , 25 , 33 was designed as a memory-resident data structure for

ecient processing of range, partial-match, and partial-range queries. Given

the k-dimensional data space of a collection of points, the K-D-Tree is derived

by tessellating the space with (k-1)-dimensional hyperplanes that are parallel

to all but the axis being split. The split plane occurs at the position that splits

the number of points as evenly as possible within the subspace being split.

The choice of which axis to split is done cyclically. The subspace tessellation

proceeds recursively until the number of points in a region forms the desired

chunk size of the data space. Figure 6.9a illustrates the idea for a 2D-space

tessellation that is represented explicitly as a tree structure in Figure 6.9b,

for a chunk size of 2. Such a K-D-Tree is referred to as a region K-D-Tree as

opposed to a point K-D-Tree. Rather than maintaining the leaf nodes, that

is, the data chunks, in memory, these can be maintained on disks while the

internal nodes of comparator values remain in memory. This variant of the

K-D-Tree is referred to as the LSD-Tree (or local split-decision tree) and is

more suitable for large disk-resident datasets. The idea is that when insertion

causes a data chunk to exceed its capacity, a decision can be made to split

the data chunk and to insert the split value as an internal node with pointers

to the newly created leaves of data chunks. The memory-resident tree nodes

can always be maintained persistently by ooading onto disk after a session

and reloading before a session.

The complexity of building a K-D-Tree of N points takes O

))

for a chunk size of B . Insertion and deletion of a new point into a K-

D-Tree takes O

(

N

/

B log 2 (

N

/

B

time. One key feature of the K-D-Tree is that

partial-match and partial-range queries involving s of k dimensions take

O

(

log

(

N

/

B

))

1

−

s

/

k

time to answer, where r is the number of the reported

points, and k the dimension of the K-D-Tree.

((

N

/

B

)

+

r

)

6.5.1.2 R-Tree and Its Variants

The R-Tree, first proposed by Guttman, 40 is an adaptation of the B + -Tree

to eciently index objects contained in k -dimensional bounding boxes. The

Scientific Data Management

Search WWH ::

Custom Search

Home