Data Cube Technology - Data Mining: Concepts and Techniques

Databases Reference

In-Depth Information

age-occupation can use its parent in either of the 1-D cuboids, age or occupation . Think

of intercuboid expansion as just an extreme case of intracuboid expansion, where all the

cells within a dimension are used in the expansion. This essentially sets the dimension

to and thus generalizes to a higher-level cuboid.

A k -dimensional cell has k direct parents in the cuboid lattice, where each parent is

( k 1)-dimensional. There are many more ancestor cells in the data cube (e.g., if mul-

tiple dimensions are rolled up simultaneously). However, we choose only one parent

here to make the search space tractable and to limit the change in the query's semantics.

As with intracuboid query expansion, correlated dimensions are not allowed in inter-

cuboid expansions. Within the uncorrelated dimensions, the two-sample t -test can be

performed to confirm that the parent and the query cell share the same sample mean. If

multiple parent cells pass the test, the test's confidence level can be adjusted progressively

higher until only one passes. Alternatively, multiple parent cells can be used to boost the

confidence simultaneously. The choice is application dependent.

Example 5.15 Intercuboid expansion to answer a query on sample data. Given the input relation in

Table 5.10, let the query on income be “ occupation Dteacher^ gender Dmale.” There is

only one sample in Table 5.10 that matches the query, and it has an income of $80,000.

Suppose the corresponding confidence interval is larger than a preset threshold. We use

intercuboid expansion to find a more reliable answer. There are two parent cells in the

data cube: “ gender D male” and “ occupation D teacher.” By moving up to “ gender D

male” (and thus setting occupation to ), the mean income is $101,000. A two sample

t -test reveals that this parent's sample mean differs significantly from that of the original

query cell, so it is ignored. Next, “ occupation D teacher” is considered. It has a mean

income of $85,000 and passes the two-sample t -test. As a result, the query is expanded

to “ occupation D teacher” and an income value of $85,000 is returned with acceptable

reliability.

“ How can we determine which method to choose—intracuboid expansion or intercuboid

expansion? ” This is difficult to answer without knowing the data and the application. A

strategy for choosing between the two is to consider what the tolerance is for change

in the query's semantics. This depends on the specific dimensions chosen in the query.

For instance, the user might tolerate a bigger change in semantics for the age dimension

than education . The difference in tolerance could be so large that the user is willing to set

age to(i.e., intercuboid expansion) rather than letting education change at all. Domain

knowledge is helpful here.

So far, our discussion has only focused on full materialization of the sampling cube.

In many real-world problems, this is often impossible, especially for high-dimensional

cases. Real-world survey data, for example, can easily contain over 50 variables (i.e.,

dimensions). The sampling cube size would grow exponentially with the number of

dimensions. To handle high-dimensional data, a sampling cube method called Sampling

Cube Shell was developed. It integrates the Frag-Shell method of Section 5.2.4 with the

query expansion approach. The shell computes only a subset of the full sampling cube.

Search WWH ::

Custom Search

Home