Multidimensional Clustering - Physical Database Design

Databases Reference

In-Depth Information

Using an SQL query we can improve on this by actually querying the number of

unique combinations of COLOR, YEAR, and REGION, which will almost always be

most accurate and a smaller number than the simple product of unique values of each

column. Why will the result be different? Correlation. The simple estimation method

based on individual column cardinality assumes dimension independence, namely that

there is no correlation between any of the dimensions, which is a worst-case scenario. In

more real-world systems, some degree of correlation is found between dimensions,

reducing the number of unique combinations. A discussion on counting and sampling

for MDC storage estimation can be found in Chapter 10.

8.6 Designing MDC Tables

8.6.1 Constraining the Storage Expansion Using Coarsification

In many cases a base column alone may have an unsuitably large cardinality of distinct

column values to make it useful as a candidate for defining cells in a cell-block MDC

storage structure, since each cell requires at least one storage block. A high cardinality of

cells can lead to a large number of partially filled blocks, resulting in storage expansion

and poor I/O characteristics. Salary is a typical example, where even in a large corpora-

tion most employees are likely to have slightly different salaries, and clustering by simi-

lar salaries may not be effective. By using expression-based columns we can derive math-

ematical and lexical models that are based on the base columns of a table, but which

have superior clustering potential across one or more dimensions. Again, using salary as

an example, INT(SALARY/1000) is likely to be superior in terms of clustering potential

to clustering directly on SALARY. Expression-based columns offer a mechanism to

express a mathematical model for defining clustering cells so that the domain of cells in

the cell-block model for MDC can be specified as an expression, rather than specifying

the domain of each cell explicitly.

Some popular RDBMS products available today provide the ability to define

expression-based columns as part of a relational table definition. These columns are

mathematical functions of columns within their record tuple. Currently, two of the

leading RDBMS vendors, IBM and Microsoft, support expression-based columns in

their product. IBM's DB2 supports this by a feature known as “Generated Columns,”

while Microsoft's SQL Server supports it through a feature known as “Computed Col-

umns.” For example, using the DB2 syntax one might define an expression-based col-

umn on an employee table that is a function of each employees' salary as follows:

CREATE TABLE EMPLOYEES (EMPLOYEE_ID INT,

SALARY DECIMAL(10,4),

SALARY_RANGE

GENERATED ALWAYS AS (SALARY/1000))

Search WWH ::

Custom Search

Home