Classification: Basic Concepts - Data Mining: Concepts and Techniques

Databases Reference

In-Depth Information

this partitioning is

Gini income 2f low , medium g .

D

/

10

14 Gini

4

14 Gini

D

.

D 1 /C

.

D 2 /

2 !

7

10

2

3

10

2

4

2

4

10

14

4

14

D

1

C

1

D 0.443

D Gini income 2f high g .

D

/

.

Similarly, the Gini index values for splits on the remaining subsets are 0.458 (for the sub-

sets f low, high g and f medium g) and 0.450 (for the subsets f medium, high g and f low g).

Therefore, the best binary split for attribute income is on f low, medium g (or f high g)

because it minimizes the Gini index. Evaluating age , we obtain f youth, senior g (or

f middle aged g) as the best split for age with a Gini index of 0.375; the attributes student

and credit rating are both binary, with Gini index values of 0.367 and 0.429, respectively.

The attribute age and splitting subset f youth, senior g therefore give the minimum

Gini index overall, with a reduction in impurity of 0.4590.357 D 0.102. The binary

split “ age 2f youth, senior ?g” results in the maximum reduction in impurity of the tuples

in D and is returned as the splitting criterion. Node N is labeled with the criterion, two

branches are grown from it, and the tuples are partitioned accordingly.

Other Attribute Selection Measures

This section on attribute selection measures was not intended to be exhaustive. We

have shown three measures that are commonly used for building decision trees. These

measures are not without their biases. Information gain, as we saw, is biased toward

multivalued attributes. Although the gain ratio adjusts for this bias, it tends to prefer

unbalanced splits in which one partition is much smaller than the others. The Gini index

is biased toward multivalued attributes and has difficulty when the number of classes is

large. It also tends to favor tests that result in equal-size partitions and purity in both

partitions. Although biased, these measures give reasonably good results in practice.

Many other attribute selection measures have been proposed. CHAID, a decision tree

algorithm that is popular in marketing, uses an attribute selection measure that is based

on the statistical

2 test for independence. Other measures include C-SEP (which per-

forms better than information gain and the Gini index in certain cases) and G-statistic

(an information theoretic measure that is a close approximation to

2 distribution).

Attribute selection measures based on the Minimum Description Length (MDL)

principle have the least bias toward multivalued attributes. MDL-based measures use

encoding techniques to define the “best” decision tree as the one that requires the fewest

number of bits to both (1) encode the tree and (2) encode the exceptions to the tree

Data Mining: Concepts and Techniques

Search WWH ::

Custom Search

Home