Accelerating Queries on Very Large Datasets - Scientific Data Management

Database Reference

In-Depth Information

rx

12

E

10

G B

8

6

F

D

A

C

4

2

lx

2468 0

12

Figure 6.6

Shaded region representing an intersection query.

to a range query of finding all line segments whose lx

5as

depicted in Figure 6.5. Similarly, a query that asks for all lines that overlap

line F

≤

4

.

5 and rx

≥

4

.

=

2,6

also translates to a range query of finding all lines whose lx

≤

6

and rx

2. The shaded region of Figure 6.6 is the area where the points

of the response to query lie. In the illustrative example given, all lines are

mapped onto points that lie above the 45-degree line. It is possible to choose

a mapping that distributes the points uniformly in the embedding space. The

basic idea is easily extended to objects of arbitrary shapes and dimensions

such as rectangles, circles, spheres, and so forth. For example, a collection of

rectangles may be represented either explicitly as rectangles with the appro-

priate index structure, such as R-trees 94 layered over it to process queries on

them, or mapped as points in a four-dimensional data space. In the subse-

quent discussion, we focus mostly on datasets that are k -dimensional and are

perceived as points in k -dimensional space.

A sample of a two-dimensional dataset is given in the table shown in

Figure 6.7. The first three columns give the Y-values, the X-values, and the

labels, respectively. The mapping of the dataset as points in a two-dimensional

space is shown in Figure 6.8 and is based on the Y- and X-values. The indexing

mechanism used to access the data is highly dependent on the physical orga-

nization of the datasets. In high-dimensional, very large, scientific datasets,

the datasets are typically vertically partitioned and stored by columns. Such

physical storage of datasets is amenable to ecient access by bitmaps and

compressed bitmap indexes. 4 , 21 , 38 , 48 , 102 , 103 Another popular physical represen-

tation of datasets, particularly when the datasets are perceived as points in a

k-dimensional data space, is by tessellating the space into rectangular regions

and then representing each region as a chunk of data so that all the points

in a region are clustered in the same chunk . A chunk typically corresponds

to the physical page size of disk storage but may span more than one page.

The different methods of tessellating and indexing the corresponding chunks

≥

Scientific Data Management

Search WWH ::

Custom Search

Home