Splitting Criteria - Data Mining with Decision Trees: Theory and Applications

Database Reference

In-Depth Information

values over those with less values [ Quinlan (1986) ] . For instance, an input

attribute that represents the national security number will probably get

the highest information gain. However, adding this attribute to a decision

tree will result in a poor generalized accuracy. For that reason, it is useful

to “normalize” the impurity-based measures, as described in the following

sections.

5.1.8

Gain Ratio

The gain ratio normalizes the information gain as follows [Quinlan (1993)]:

( a i ,S )= Information Gain( a i ,S )

Entropy ( a i ,S )

Gain Ratio

|

.

(5.9)

Note that this ratio is not defined when the denominator is zero.

Furthermore, the ratio may tend to favor attributes for which the denomi-

nator is very small. Accordingly, it is suggested that the ratio be carried out

in two stages. First, the information gain is calculated for all attributes. As

a consequence of considering only attributes that have performed at least

as well as the average information gain, the attribute that has obtained the

best ratio gain is selected. Quinlan (1988) has shown that the gain ratio

tends to outperform simple information gain criteria, both in accuracy and

in terms of classifier complexity. A penalty is assessed for the information

gain of a continuous attribute with many potential splits.

5.1.9

Distance Measure

The Distance Measure, like the Gain Ratio, normalizes the impurity

measure. However, it is performed differently [Lopez de Mantras (1991)]:

∆Φ( a i ,S )

.

|

σ a i = v i,j AND y = c k S

|

log 2 |

σ a i = v i,j AND y = c k S

|

−

·

|

S

|

S

|

v i,j ∈dom ( a i )

c k ∈dom ( y )

(5.10)

5.1.10

Binary Criteria

The binary criteria are used for creating binary decision trees. These

measures are based on division of the input attribute domain into two

subdomains.

Data Mining with Decision Trees: Theory and Applications

Search WWH ::

Custom Search

Home