Database Reference
In-Depth Information
Year
Established
1977
Y
N
AnnualProfit
500,000
AnnualProfit
1,000,000
Y
N
Y
N
B
G
B
G
Fig. 9.2 An example of a decision tree model for the Northwind customers
relatively recently, its profit is high enough to be considered as a reliable
customer.
There are many algorithms that compute a decision tree. A well-known one
is the ID3 algorithm. Here, starting at the root, which initially contains all the
training samples, the attribute conveying more information is used to split
the nodes in a recursive fashion. Other algorithms like SPRINT and SLIQ
are also used to improve the eciency of classification in large databases.
The basic process involves two steps, partitioning and pruning, as shown
below. It is usual to split the data set into two halves, the first being used
for partitioning and the second for pruning:
ID3 Algorithm
INPUT: A data set T
OUTPUT: A classification tree
BEGIN
1. Build an initial tree from the training data set T
IF all points in T belong to the same class THEN RETURN;
Evaluate splits for every attribute;
Use the best split for partition T into T 1 and T 2 ;
ID3( T 1 );
ID3( T 2 );
2. Prune this tree to increase test accuracy.
This removes branches that are likely to induce errors.
END
A key challenge in these algorithms is how to partition the tree nodes. This
is done measuring the information conveyed by each attribute. Attributes
conveying more information are selected first. There are many methods used
for node partitioning. For example, the SPRINT algorithm uses the Gini
index, while the SLIQ algorithm uses the notion of information entropy.
The Gini index for a data set T whose elements are classified into C
classesisdefinedasfollows:
C
p i
Gini ( T )=1
Search WWH ::




Custom Search