Splitting Criteria - Data Mining with Decision Trees: Theory and Applications

Database Reference

In-Depth Information

Gini index has been used in various works such as [ Breiman et al . (1984) ]

and [ Gelfand et al . (1991) ] and it is defined as:

σ y = c j S

|

2

Gini ( y, S )=1

−

.

(5.5)

S

|

c j ∈dom ( y )

Consequently, the evaluation criterion for selecting the attribute a i is

defined as:

GiniGain ( a i ,S )= Gini ( y, S )

σ a i = v i,j S

|

(5.6)

−

·

Gini ( y, σ a i = v i,j S ) .

S

|

v i,j ∈dom ( a i )

5.1.5

Likelihood Ratio Chi-squared Statistics

The likelihood-ratio is defined as [Attneave (1959)]

G 2 ( a i ,S )=2

·

ln(2)

·|

S

|·

InformationGain( a i ,S ) .

(5.7)

This ratio is useful for measuring the statistical significance of the

information gain criterion. The zero hypothesis ( H 0 ) is that both the

input and target attributes are conditionally independent. If H 0 holds,

the test statistic is distributed as χ 2 with degrees of freedom equal to:

( dom ( a i )

−

1)

·

( dom ( y )

−

1).

5.1.6

DKM Criterion

The DKM criterion is an impurity-based splitting criteria designed for

binary class attributes [ Kearns and Mansour (1999) ] . The impurity-based

function is defined as:

|

.

σ y = c 1 S

|

σ y = c 2 S

|

DKM ( y, S )=2

·

(5.8)

|

S

|

S

|

It has been theoretically proved that this criterion requires smaller trees

for obtaining a certain error than other impurity-based criteria (information

gain and Gini index).

5.1.7

Normalized Impurity-based Criteria

The impurity-based criterion described above is biased towards attributes

with larger domain values. Namely, it prefers input attributes with many

Data Mining with Decision Trees: Theory and Applications

Search WWH ::

Custom Search

Home