Databases Reference
In-Depth Information
this partitioning is
Gini income 2f low , medium g .
D
/
10
14 Gini
4
14 Gini
D
.
D 1 /C
.
D 2 /
2 !
2 !
7
10
2
3
10
2
4
2
2
4
10
14
4
14
D
1
C
1
D 0.443
D Gini income 2f high g .
D
/
.
Similarly, the Gini index values for splits on the remaining subsets are 0.458 (for the sub-
sets f low, high g and f medium g) and 0.450 (for the subsets f medium, high g and f low g).
Therefore, the best binary split for attribute income is on f low, medium g (or f high g)
because it minimizes the Gini index. Evaluating age , we obtain f youth, senior g (or
f middle aged g) as the best split for age with a Gini index of 0.375; the attributes student
and credit rating are both binary, with Gini index values of 0.367 and 0.429, respectively.
The attribute age and splitting subset f youth, senior g therefore give the minimum
Gini index overall, with a reduction in impurity of 0.4590.357 D 0.102. The binary
split “ age 2f youth, senior ?g” results in the maximum reduction in impurity of the tuples
in D and is returned as the splitting criterion. Node N is labeled with the criterion, two
branches are grown from it, and the tuples are partitioned accordingly.
Other Attribute Selection Measures
This section on attribute selection measures was not intended to be exhaustive. We
have shown three measures that are commonly used for building decision trees. These
measures are not without their biases. Information gain, as we saw, is biased toward
multivalued attributes. Although the gain ratio adjusts for this bias, it tends to prefer
unbalanced splits in which one partition is much smaller than the others. The Gini index
is biased toward multivalued attributes and has difficulty when the number of classes is
large. It also tends to favor tests that result in equal-size partitions and purity in both
partitions. Although biased, these measures give reasonably good results in practice.
Many other attribute selection measures have been proposed. CHAID, a decision tree
algorithm that is popular in marketing, uses an attribute selection measure that is based
on the statistical
2 test for independence. Other measures include C-SEP (which per-
forms better than information gain and the Gini index in certain cases) and G-statistic
(an information theoretic measure that is a close approximation to
2 distribution).
Attribute selection measures based on the Minimum Description Length (MDL)
principle have the least bias toward multivalued attributes. MDL-based measures use
encoding techniques to define the “best” decision tree as the one that requires the fewest
number of bits to both (1) encode the tree and (2) encode the exceptions to the tree
 
Search WWH ::




Custom Search