Database Reference
In-Depth Information
values over those with less values [ Quinlan (1986) ] . For instance, an input
attribute that represents the national security number will probably get
the highest information gain. However, adding this attribute to a decision
tree will result in a poor generalized accuracy. For that reason, it is useful
to “normalize” the impurity-based measures, as described in the following
sections.
5.1.8
Gain Ratio
The gain ratio normalizes the information gain as follows [Quinlan (1993)]:
( a i ,S )= Information Gain( a i ,S )
Entropy ( a i ,S )
Gain Ratio
|
.
(5.9)
Note that this ratio is not defined when the denominator is zero.
Furthermore, the ratio may tend to favor attributes for which the denomi-
nator is very small. Accordingly, it is suggested that the ratio be carried out
in two stages. First, the information gain is calculated for all attributes. As
a consequence of considering only attributes that have performed at least
as well as the average information gain, the attribute that has obtained the
best ratio gain is selected. Quinlan (1988) has shown that the gain ratio
tends to outperform simple information gain criteria, both in accuracy and
in terms of classifier complexity. A penalty is assessed for the information
gain of a continuous attribute with many potential splits.
5.1.9
Distance Measure
The Distance Measure, like the Gain Ratio, normalizes the impurity
measure. However, it is performed differently [Lopez de Mantras (1991)]:
∆Φ( a i ,S )
.
|
σ a i = v i,j AND y = c k S
|
log 2 |
σ a i = v i,j AND y = c k S
|
·
|
S
|
|
S
|
v i,j ∈dom ( a i )
c k ∈dom ( y )
(5.10)
5.1.10
Binary Criteria
The binary criteria are used for creating binary decision trees. These
measures are based on division of the input attribute domain into two
subdomains.
Search WWH ::




Custom Search