Data Cube Technology - Data Mining: Concepts and Techniques

Databases Reference

In-Depth Information

value (i.e., the value to be predicted). Expanding within these dimensions will likely

increase the sample size and not shift the query's answer. Consider an example of a 2-D

query specifying education D “college” and birth month D “July.” Let the cube measure

be average income . Intuitively, education has a high correlation to income while birth

month does not. It would be harmful to expand the education dimension to include val-

ues such as “graduate” or “high school.” They are likely to alter the final result. However,

expansion in the birth month dimension to include other month values could be helpful,

because it is unlikely to change the result but will increase sampling size.

To mathematically measure the correlation of a dimension to the cube value, the

correlation between the dimension's values and their aggregated cube measures is com-

puted. Pearson's correlation coefficient for numeric data and the

2 correlation test for

nominal data are popularly used correlation measures, although many other measures,

such as covariance , can be used. (These measures were presented in Section 3.3.2.) A

dimension that is strongly correlated with the value to be predicted should not be a

candidate for expansion. Notice that since the correlation of a dimension with the cube

measure is independent of a particular query, it should be precomputed and stored with

the cube measure to facilitate efficient online analysis.

After selecting dimensions for expansion, the next question is “ Which values within

these dimensions should the expansion use? ” This relies on the semantic knowledge of

the dimensions in question. The goal should be to select semantically similar values to

minimize the risk of altering the final result. Consider the age dimension—similarity

of values in this dimension is clear. There is a definite (numeric) order to the val-

ues. Dimensions with numeric or ordinal (ranked) data (like education ) have a definite

ordering among data values. Therefore, we can select values that are close to the instan-

tiated query value. For nominal data of a dimension that is organized in a multilevel

hierarchy in a data cube (e.g., location ), we should select those values located in the

same branch of the tree (e.g., the same district or city).

By considering additional data during query expansion, we are aiming for a more

accurate and reliable answer. As mentioned before, strongly correlated dimensions are

precluded from expansion for this purpose. An additional strategy is to ensure that

new samples share the “same” cube measure value (e.g., mean income) as the exist-

ing samples in the query cell. The two-sample t -test is a relatively simple statistical

method that can be used to determine whether two samples have the same mean (or

any other point estimate), where “same” means that they do not differ significantly. (It

is described in greater detail in Section 8.5.5 on model selection using statistical tests of

significance.)

The test determines whether two samples have the same mean (the null hypothesis)

with the only assumption being that they are both normally distributed. The test fails

if there is evidence that the two samples do not share the same mean. Furthermore, the

test can be performed with a confidence level as an input. This allows the user to control

how strict or loose the query expansion will be.

Example 5.14 shows how the intracuboid expansion strategies just described can be

used to answer a query on sample data.

Data Mining: Concepts and Techniques

Search WWH ::

Custom Search

Home