Databases Reference
In-Depth Information
1.
Only a fraction of the database data can be cached in system memory. While the
exact ratio of cache size to storage size varies widely from system to system, ratios
of 1:10 or 1:5 are common.
2.
The unit of I/O and similarly the unit of caching within the database is a “page”
that represents some number of records. Page sizes are usually in the range 2 K
to 1 MB (with 4-16 K being typical), and record width is usually in the range
of 20-1,000 bytes, with ~150 bytes being a prototypical table design. There-
fore, one database storage page will typically store several dozen records.
The characteristics used for clustering depend on the data use. For example, pixels
in an image may be clustered by color density or by object boundaries. Data in a rela-
tional database is often intended to be clustered by significant query use. For example,
in a database where range queries on data are common it may be useful to cluster data
by date ranges. January records may be in one physical cluster, while February records
may be in another cluster. Figure 8.1 shows the same data as it might appear in either
clustered or unclustered form. The following three significant observations can be made
from this figure:
1.
A query on a given month of data would only need to access a single storage block
to obtain all of the query result data if the table data are clustered by month. The
same query on this example data, when the data is unclustered, would need to
access all four storage blocks. Thus, a query on January data in the clustered case
performs one-quarter of the I/O that the same query on unclustered data might.
Perhaps more importantly, if the pages are randomly distributed on disk, the
unclustered data is likely to incur four times as many disk seeks.
2.
Within a cluster, data are not sequenced. The cluster for January contains only
January data, but the records for each day of the month appear in an arbitrary
order within the storage block.
3.
A query over all data will access all four storage blocks in both the clustered and
unclustered data cases. In this example, the I/O overhead for table scans is not
affected by the cluster strategy for the data. However, later in this section we
will see important examples where this is not the case.
8.1.2 MDC
The performance of multidimensional queries (group-bys, range queries, etc.) is often
improved through data clustering, which can reduce I/O costs enormously and reduce
CPU costs. Clustering groups of records physically in storage is based on one or more
specified clustering dimensions. Yet the scheme for clustering, the definition of the
clustering keys, and the granularity of the clustering are nontrivial choices and can be
Search WWH ::




Custom Search