Data Analytics: Exploiting the Data Warehouse - Data Warehouse Systems: Design and Implementation - page 328

Database Reference

In-Depth Information

Classification methods were borrowed from statistics and machine learn-

ing. The most popular methods are the ones based on decision trees .A

decision tree has three types of nodes: a root node , with no incoming edges and

zero or more outgoing edges; internal nodes , with exactly one incoming edge

and two or more outgoing edges; and leaf or terminal nodes , with exactly one

incoming edge and no outgoing edges. In a decision tree, each terminal node

is assigned a class label. Nonterminal nodes contain attribute test conditions

to split records with different characteristics from each other.

For example, in the Northwind case study, we want to generate a very

simple classification of customers, consisting in just two classes: good or

bad customers, identified as ' G 'and' B ', respectively. For this, we use two

demographic characteristics: the year the business was established and the

annual profit. To represent the first characteristic, we use the attribute

YearEstablished . For the second and to keep the example simple at this stage,

we use a continuous attribute called AnnualProfitCont . Recall that in Sect. 9.1

the attribute AnnualProfit has been categorized into six classes. For the

current example, we use the actual continuous values. Later in this chapter,

we will show an example using discrete attributes when we present Analysis

Services data mining tools. Intuitively, to be classified as ' G ', a customer

established a long time ago (say, 20 years) requires a smaller profit than the

profit required to a customer more recently established. We will see below

how this classification is produced.

YearEstablished

AnnualProfitCont

Class

1977

1,000,000

G

1961

500,000

B

1978

1,300,000

B

1985

1,200,000

G

1995

1,400,000

B

1975

1,100,000

G

In this example, we can use the YearEstablished attribute to separate records

first, and, in a second step, we can use the AnnualProfitCont attribute for

a finer classification within the class of customers with similar amount

of years in the market. The intuition behind this is that the attribute

YearEstablished conveys more information about the record than the attribute

AnnualProfitCont .

Once the model has been built, classifying a test record is straightforward,

as this is done by traversing the tree and evaluating the conditions at

each node. For example, we can build a tree like the one in Fig. 9.2 ,based

on the training data. Then, if a record with YearEstablished = 1995 and

AnnualProfitCont = 1,200,000 arrives, it will be classified as ' G ', following

the path: YearEstablished

1,000,000 =

false. Again, the rationale here is that even if the customer has established

≤

1977 = false, AnnualProfitCont

≤

Next Page

Data Warehouse Systems: Design and Implementation

Search WWH ::

Custom Search

Home