Databases Reference
In-Depth Information
Using an SQL query we can improve on this by actually querying the number of
unique combinations of COLOR, YEAR, and REGION, which will almost always be
most accurate and a smaller number than the simple product of unique values of each
column. Why will the result be different? Correlation. The simple estimation method
based on individual column cardinality assumes dimension independence, namely that
there is no correlation between any of the dimensions, which is a worst-case scenario. In
more real-world systems, some degree of correlation is found between dimensions,
reducing the number of unique combinations. A discussion on counting and sampling
for MDC storage estimation can be found in Chapter 10.
8.6 Designing MDC Tables
8.6.1 Constraining the Storage Expansion Using Coarsification
In many cases a base column alone may have an unsuitably large cardinality of distinct
column values to make it useful as a candidate for defining cells in a cell-block MDC
storage structure, since each cell requires at least one storage block. A high cardinality of
cells can lead to a large number of partially filled blocks, resulting in storage expansion
and poor I/O characteristics. Salary is a typical example, where even in a large corpora-
tion most employees are likely to have slightly different salaries, and clustering by simi-
lar salaries may not be effective. By using expression-based columns we can derive math-
ematical and lexical models that are based on the base columns of a table, but which
have superior clustering potential across one or more dimensions. Again, using salary as
an example, INT(SALARY/1000) is likely to be superior in terms of clustering potential
to clustering directly on SALARY. Expression-based columns offer a mechanism to
express a mathematical model for defining clustering cells so that the domain of cells in
the cell-block model for MDC can be specified as an expression, rather than specifying
the domain of each cell explicitly.
Some popular RDBMS products available today provide the ability to define
expression-based columns as part of a relational table definition. These columns are
mathematical functions of columns within their record tuple. Currently, two of the
leading RDBMS vendors, IBM and Microsoft, support expression-based columns in
their product. IBM's DB2 supports this by a feature known as “Generated Columns,”
while Microsoft's SQL Server supports it through a feature known as “Computed Col-
umns.” For example, using the DB2 syntax one might define an expression-based col-
umn on an employee table that is a function of each employees' salary as follows:
CREATE TABLE EMPLOYEES (EMPLOYEE_ID INT,
SALARY DECIMAL(10,4),
SALARY_RANGE
GENERATED ALWAYS AS (SALARY/1000))
Search WWH ::




Custom Search