Databases Reference
In-Depth Information
resulting partitions and reflects the least randomness or “impurity” in these parti-
tions. Such an approach minimizes the expected number of tests needed to classify
a given tuple and guarantees that a simple (but not necessarily the simplest) tree is
found.
The expected information needed to classify a tuple in D is given by
X
Info
.
D
/D
p i log 2 .
p i /
,
(8.1)
i D1
where p i is the nonzero probability that an arbitrary tuple in D belongs to class C i and
is estimated by j C i , D j/j D j. A log function to the base 2 is used, because the information
is encoded in bits. Info ( D ) is just the average amount of information needed to identify
the class label of a tuple in D . Note that, at this point, the information we have is based
solely on the proportions of tuples of each class. Info ( D ) is also known as the entropy
of D .
Now, suppose we were to partition the tuples in D on some attribute A having v dis-
tinct values, f a 1 , a 2 ,
, a v g, as observed from the training data. If A is discrete-valued,
these values correspond directly to the v outcomes of a test on A . Attribute A can be used
to split D into v partitions or subsets,f D 1 , D 2 ,
:::
, D v g, where D j contains those tuples in
D that have outcome a j of A . These partitions would correspond to the branches grown
from node N . Ideally, we would like this partitioning to produce an exact classification
of the tuples. That is, we would like for each partition to be pure. However, it is quite
likely that the partitions will be impure (e.g., where a partition may contain a collection
of tuples from different classes rather than from a single class).
How much more information would we still need (after the partitioning) to arrive at
an exact classification? This amount is measured by
:::
v X
j D j j
j D j Info
Info A .
D
/D
.
D j /
.
(8.2)
j D1
The term j D j j
is the expected informa-
tion required to classify a tuple from D based on the partitioning by A . The smaller the
expected information (still) required, the greater the purity of the partitions.
Information gain is defined as the difference between the original information
requirement (i.e., based on just the proportion of classes) and the new requirement (i.e.,
obtained after partitioning on A ). That is,
j D j acts as the weight of the j th partition. Info A .
D
/
Gain
.
A
/D Info
.
D
/ Info A .
D
/
.
(8.3)
In other words, Gain ( A ) tells us how much would be gained by branching on A . It is
the expected reduction in the information requirement caused by knowing the value of
A . The attribute A with the highest information gain, Gain
, is chosen as the splitting
attribute at node N . This is equivalent to saying that we want to partition on the attribute
A that would do the “best classification,” so that the amount of information still required
to finish classifying the tuples is minimal (i.e., minimum Info A .
.
A
/
D
/
).
 
Search WWH ::




Custom Search