Databases Reference
In-Depth Information
“But how can we compute the information gain of an attribute that is continuous-
valued, unlike in the example?” Suppose, instead, that we have an attribute A that is
continuous-valued, rather than discrete-valued. (For example, suppose that instead
of the discretized version of age from the example, we have the raw values for this
attribute.) For such a scenario, we must determine the “best” split-point for A , where
the split-point is a threshold on A .
We first sort the values of A in increasing order. Typically, the midpoint between each
pair of adjacent values is considered as a possible split-point. Therefore, given v values
of A , then v 1 possible splits are evaluated. For example, the midpoint between the
values a i and a i C1 of A is
a i C a i C 1
2
.
(8.4)
If the values of A are sorted in advance, then determining the best split for A requires
only one pass through the values. For each possible split-point for A , we evaluate
Info A .
, where the number of partitions is two, that is, v D 2 (or j D 1, 2) in Eq. (8.2).
The point with the minimum expected information requirement for A is selected as the
split point for A . D 1 is the set of tuples in D satisfying A split point , and D 2 is the set
of tuples in D satisfying A
D
/
>
split point .
Gain Ratio
The information gain measure is biased toward tests with many outcomes. That is, it
prefers to select attributes having a large number of values. For example, consider an
attribute that acts as a unique identifier such as product ID . A split on product ID would
result in a large number of partitions (as many as there are values), each one containing
just one tuple. Because each partition is pure, the information required to classify data
set D based on this partitioning would be Info product ID .
/D 0. Therefore, the informa-
tion gained by partitioning on this attribute is maximal. Clearly, such a partitioning is
useless for classification.
C4.5, a successor of ID3, uses an extension to information gain known as gain ratio ,
which attempts to overcome this bias. It applies a kind of normalization to information
gain using a “split information” value defined analogously with Info
D
.
D
/
as
j D j j
j D j
v X
j D j j
j D j log 2
SplitInfo A .
D
/D
.
(8.5)
j D1
This value represents the potential information generated by splitting the training
data set, D , into v partitions, corresponding to the v outcomes of a test on attribute A .
Note that, for each outcome, it considers the number of tuples having that outcome
with respect to the total number of tuples in D . It differs from information gain, which
measures the information with respect to classification that is acquired based on the
 
Search WWH ::




Custom Search