Classification: Basic Concepts - Data Mining: Concepts and Techniques

Databases Reference

In-Depth Information

resulting partitions and reflects the least randomness or “impurity” in these parti-

tions. Such an approach minimizes the expected number of tests needed to classify

a given tuple and guarantees that a simple (but not necessarily the simplest) tree is

found.

The expected information needed to classify a tuple in D is given by

Info

p i log 2 .

p i /

(8.1)

i D1

where p i is the nonzero probability that an arbitrary tuple in D belongs to class C i and

is estimated by j C i , D j/j D j. A log function to the base 2 is used, because the information

is encoded in bits. Info ( D ) is just the average amount of information needed to identify

the class label of a tuple in D . Note that, at this point, the information we have is based

solely on the proportions of tuples of each class. Info ( D ) is also known as the entropy

of D .

Now, suppose we were to partition the tuples in D on some attribute A having v dis-

tinct values, f a 1 , a 2 ,

, a v g, as observed from the training data. If A is discrete-valued,

these values correspond directly to the v outcomes of a test on A . Attribute A can be used

to split D into v partitions or subsets,f D 1 , D 2 ,

:::

, D v g, where D j contains those tuples in

D that have outcome a j of A . These partitions would correspond to the branches grown

from node N . Ideally, we would like this partitioning to produce an exact classification

of the tuples. That is, we would like for each partition to be pure. However, it is quite

likely that the partitions will be impure (e.g., where a partition may contain a collection

of tuples from different classes rather than from a single class).

How much more information would we still need (after the partitioning) to arrive at

an exact classification? This amount is measured by

:::

v X

j D j j

j D j Info

Info A .

D j /

(8.2)

j D1

The term j D j j

is the expected informa-

tion required to classify a tuple from D based on the partitioning by A . The smaller the

expected information (still) required, the greater the purity of the partitions.

Information gain is defined as the difference between the original information

requirement (i.e., based on just the proportion of classes) and the new requirement (i.e.,

obtained after partitioning on A ). That is,

j D j acts as the weight of the j th partition. Info A .

Gain

/D Info

/ Info A .

(8.3)

In other words, Gain ( A ) tells us how much would be gained by branching on A . It is

the expected reduction in the information requirement caused by knowing the value of

A . The attribute A with the highest information gain, Gain

, is chosen as the splitting

attribute at node N . This is equivalent to saying that we want to partition on the attribute

A that would do the “best classification,” so that the amount of information still required

to finish classifying the tuples is minimal (i.e., minimum Info A .

Data Mining: Concepts and Techniques

Search WWH ::

Custom Search

Home