Data Cube Technology - Data Mining: Concepts and Techniques

Databases Reference

In-Depth Information

customer depended greatly on the customer's gender? ” Notice that he believes time and

location play a role in predicting valued customers, but at what granularity levels do

they depend on gender for this task? For example, is performing analysis using f month,

country g better than f year, state g?

Consider a data table D (e.g., the customer table). Let X be the attributes set for

which no concept hierarchy has been defined (e.g., gender, salary ). Let Y be the class-

label attribute (e.g., valued customer ), and Z be the set of multilevel attributes, that is,

attributes for which concept hierarchies have been defined (e.g., time, location ). Let V

be the set of attributes for which we would like to define their predictiveness. In our

example, this set is f gender g. The predictiveness of V on a data subset can be quantified

by the difference in accuracy between the model built on that subset using X to predict Y

and the model built on that subset using X V (e.g.,f salary g) to predict Y. The intuition

is that, if the difference is large, V must play an important role in the prediction of class

label Y.

Given a set of attributes, V , and a learning algorithm, the prediction cube at granular-

ity h l 1 ,

, l d i (e.g., h year , state i) is a d -dimensional array, in which the value in each cell

(e.g., [2010, Illinois]) is the predictiveness of V evaluated on the subset defined by the

cell (e.g., the records in the customer table with time in 2010 and location in Illinois).

:::

Supporting OLAP roll-up and drill-down operations on a prediction cube is a

computational challenge requiring the materialization of cell values at many different

granularities. For simplicity, we can consider only full materialization. A naïve way to

fully materialize a prediction cube is to exhaustively build models and evaluate them for

each cell and granularity. This method is very expensive if the base data set is large.

An ensemble method called Probability-Based Ensemble (PBE) was developed as a

more feasible alternative. It requires model construction for only the finest-grained

cells. OLAP-style bottom-up aggregation is then used to generate the values of the

coarser-grained cells.

The prediction of a predictive model can be seen as finding a class label that maxi-

mizes a scoring function. The PBE method was developed to approximately make the

scoring function of any predictive model distributively decomposable. In our discus-

sion of data cube measures in Section 4.2.4, we showed that distributive and algebraic

measures can be computed efficiently. Therefore, if the scoring function used is dis-

tributively or algebraically decomposable, prediction cubes can also be computed with

efficiency. In this way, the PBE method reduces prediction cube computation to data

cube computation.

For example, previous studies have shown that the naıve Bayes classifier has an alge-

braically decomposable scoring function, and the kernel density-based classifier has a

distributively decomposable scoring function. 8

Therefore, either of these could be used

8 Naıve Bayes classifiers are detailed in Chapter 8. Kernel density-based classifiers, such as support vector

machines, are described in Chapter 9.

Data Mining: Concepts and Techniques

Search WWH ::

Custom Search

Home