Database Reference
In-Depth Information
Gini index has been used in various works such as [ Breiman et al . (1984) ]
and [ Gelfand et al . (1991) ] and it is defined as:
σ y = c j S
|
2
Gini ( y, S )=1
.
(5.5)
S
|
c j ∈dom ( y )
Consequently, the evaluation criterion for selecting the attribute a i is
defined as:
GiniGain ( a i ,S )= Gini ( y, S )
σ a i = v i,j S
|
(5.6)
·
Gini ( y, σ a i = v i,j S ) .
S
|
v i,j ∈dom ( a i )
5.1.5
Likelihood Ratio Chi-squared Statistics
The likelihood-ratio is defined as [Attneave (1959)]
G 2 ( a i ,S )=2
·
ln(2)
·|
S
InformationGain( a i ,S ) .
(5.7)
This ratio is useful for measuring the statistical significance of the
information gain criterion. The zero hypothesis ( H 0 ) is that both the
input and target attributes are conditionally independent. If H 0 holds,
the test statistic is distributed as χ 2 with degrees of freedom equal to:
( dom ( a i )
1)
·
( dom ( y )
1).
5.1.6
DKM Criterion
The DKM criterion is an impurity-based splitting criteria designed for
binary class attributes [ Kearns and Mansour (1999) ] . The impurity-based
function is defined as:
|
|
.
σ y = c 1 S
|
σ y = c 2 S
|
DKM ( y, S )=2
·
·
(5.8)
|
S
|
|
S
|
It has been theoretically proved that this criterion requires smaller trees
for obtaining a certain error than other impurity-based criteria (information
gain and Gini index).
5.1.7
Normalized Impurity-based Criteria
The impurity-based criterion described above is biased towards attributes
with larger domain values. Namely, it prefers input attributes with many
Search WWH ::




Custom Search