Multidimensional Clustering - Physical Database Design

Databases Reference

In-Depth Information

1.

Only a fraction of the database data can be cached in system memory. While the

exact ratio of cache size to storage size varies widely from system to system, ratios

of 1:10 or 1:5 are common.

2.

The unit of I/O and similarly the unit of caching within the database is a “page”

that represents some number of records. Page sizes are usually in the range 2 K

to 1 MB (with 4-16 K being typical), and record width is usually in the range

of 20-1,000 bytes, with ~150 bytes being a prototypical table design. There-

fore, one database storage page will typically store several dozen records.

The characteristics used for clustering depend on the data use. For example, pixels

in an image may be clustered by color density or by object boundaries. Data in a rela-

tional database is often intended to be clustered by significant query use. For example,

in a database where range queries on data are common it may be useful to cluster data

by date ranges. January records may be in one physical cluster, while February records

may be in another cluster. Figure 8.1 shows the same data as it might appear in either

clustered or unclustered form. The following three significant observations can be made

from this figure:

1.

A query on a given month of data would only need to access a single storage block

to obtain all of the query result data if the table data are clustered by month. The

same query on this example data, when the data is unclustered, would need to

access all four storage blocks. Thus, a query on January data in the clustered case

performs one-quarter of the I/O that the same query on unclustered data might.

Perhaps more importantly, if the pages are randomly distributed on disk, the

unclustered data is likely to incur four times as many disk seeks.

2.

Within a cluster, data are not sequenced. The cluster for January contains only

January data, but the records for each day of the month appear in an arbitrary

order within the storage block.

3.

A query over all data will access all four storage blocks in both the clustered and

unclustered data cases. In this example, the I/O overhead for table scans is not

affected by the cluster strategy for the data. However, later in this section we

will see important examples where this is not the case.

8.1.2 MDC

The performance of multidimensional queries (group-bys, range queries, etc.) is often

improved through data clustering, which can reduce I/O costs enormously and reduce

CPU costs. Clustering groups of records physically in storage is based on one or more

specified clustering dimensions. Yet the scheme for clustering, the definition of the

clustering keys, and the granularity of the clustering are nontrivial choices and can be

Search WWH ::

Custom Search

Home