Database Reference
In-Depth Information
Classification methods were borrowed from statistics and machine learn-
ing. The most popular methods are the ones based on decision trees .A
decision tree has three types of nodes: a root node , with no incoming edges and
zero or more outgoing edges; internal nodes , with exactly one incoming edge
and two or more outgoing edges; and leaf or terminal nodes , with exactly one
incoming edge and no outgoing edges. In a decision tree, each terminal node
is assigned a class label. Nonterminal nodes contain attribute test conditions
to split records with different characteristics from each other.
For example, in the Northwind case study, we want to generate a very
simple classification of customers, consisting in just two classes: good or
bad customers, identified as ' G 'and' B ', respectively. For this, we use two
demographic characteristics: the year the business was established and the
annual profit. To represent the first characteristic, we use the attribute
YearEstablished . For the second and to keep the example simple at this stage,
we use a continuous attribute called AnnualProfitCont . Recall that in Sect. 9.1
the attribute AnnualProfit has been categorized into six classes. For the
current example, we use the actual continuous values. Later in this chapter,
we will show an example using discrete attributes when we present Analysis
Services data mining tools. Intuitively, to be classified as ' G ', a customer
established a long time ago (say, 20 years) requires a smaller profit than the
profit required to a customer more recently established. We will see below
how this classification is produced.
YearEstablished
AnnualProfitCont
Class
1977
1,000,000
G
1961
500,000
B
1978
1,300,000
B
1985
1,200,000
G
1995
1,400,000
B
1975
1,100,000
G
In this example, we can use the YearEstablished attribute to separate records
first, and, in a second step, we can use the AnnualProfitCont attribute for
a finer classification within the class of customers with similar amount
of years in the market. The intuition behind this is that the attribute
YearEstablished conveys more information about the record than the attribute
AnnualProfitCont .
Once the model has been built, classifying a test record is straightforward,
as this is done by traversing the tree and evaluating the conditions at
each node. For example, we can build a tree like the one in Fig. 9.2 ,based
on the training data. Then, if a record with YearEstablished = 1995 and
AnnualProfitCont = 1,200,000 arrives, it will be classified as ' G ', following
the path: YearEstablished
1,000,000 =
false. Again, the rationale here is that even if the customer has established
1977 = false, AnnualProfitCont
 
Search WWH ::




Custom Search