Databases Reference
In-Depth Information
“But how can we compute the information gain of an attribute that is continuous-
valued, unlike in the example?”
Suppose, instead, that we have an attribute
A
that is
continuous-valued, rather than discrete-valued. (For example, suppose that instead
of the discretized version of
age
from the example, we have the raw values for this
attribute.) For such a scenario, we must determine the “best”
split-point
for
A
, where
the split-point is a threshold on
A
.
We first sort the values of
A
in increasing order. Typically, the midpoint between each
pair of adjacent values is considered as a possible split-point. Therefore, given
v
values
of
A
, then
v
1 possible splits are evaluated. For example, the midpoint between the
values
a
i
and
a
i
C1
of
A
is
a
i
C
a
i
C
1
2
.
(8.4)
If the values of
A
are sorted in advance, then determining the best split for
A
requires
only one pass through the values. For each possible split-point for
A
, we evaluate
Info
A
.
, where the number of partitions is two, that is,
v
D 2 (or
j
D 1, 2) in Eq. (8.2).
The point with the minimum expected information requirement for
A
is selected as the
split point
for
A
.
D
1
is the set of tuples in
D
satisfying
A
split point
, and
D
2
is the set
of tuples in
D
satisfying
A
D
/
>
split point
.
Gain Ratio
The information gain measure is biased toward tests with many outcomes. That is, it
prefers to select attributes having a large number of values. For example, consider an
attribute that acts as a unique identifier such as
product ID
. A split on
product ID
would
result in a large number of partitions (as many as there are values), each one containing
just one tuple. Because each partition is pure, the information required to classify data
set
D
based on this partitioning would be
Info
product ID
.
/D 0. Therefore, the informa-
tion gained by partitioning on this attribute is maximal. Clearly, such a partitioning is
useless for classification.
C4.5, a successor of ID3, uses an extension to information gain known as
gain ratio
,
which attempts to overcome this bias. It applies a kind of normalization to information
gain using a “split information” value defined analogously with
Info
D
.
D
/
as
j
D
j
j
j
D
j
v
X
j
D
j
j
j
D
j
log
2
SplitInfo
A
.
D
/D
.
(8.5)
j
D1
This value represents the potential information generated by splitting the training
data set,
D
, into
v
partitions, corresponding to the
v
outcomes of a test on attribute
A
.
Note that, for each outcome, it considers the number of tuples having that outcome
with respect to the total number of tuples in
D
. It differs from information gain, which
measures the information with respect to classification that is acquired based on the