Database Reference
In-Depth Information
Gini index has been used in various works such as
[
Breiman
et al
. (1984)
]
and
[
Gelfand
et al
. (1991)
]
and it is defined as:
σ
y
=
c
j
S
|
2
Gini
(
y, S
)=1
−
.
(5.5)
S
|
c
j
∈dom
(
y
)
Consequently, the evaluation criterion for selecting the attribute
a
i
is
defined as:
GiniGain
(
a
i
,S
)=
Gini
(
y, S
)
σ
a
i
=
v
i,j
S
|
(5.6)
−
·
Gini
(
y, σ
a
i
=
v
i,j
S
)
.
S
|
v
i,j
∈dom
(
a
i
)
5.1.5
Likelihood Ratio Chi-squared Statistics
The likelihood-ratio is defined as [Attneave (1959)]
G
2
(
a
i
,S
)=2
·
ln(2)
·|
S
|·
InformationGain(
a
i
,S
)
.
(5.7)
This ratio is useful for measuring the statistical significance of the
information gain criterion. The zero hypothesis (
H
0
) is that both the
input and target attributes are conditionally independent. If
H
0
holds,
the test statistic is distributed as
χ
2
with degrees of freedom equal to:
(
dom
(
a
i
)
−
1)
·
(
dom
(
y
)
−
1).
5.1.6
DKM Criterion
The DKM criterion is an impurity-based splitting criteria designed for
binary class attributes
[
Kearns and Mansour (1999)
]
. The impurity-based
function is defined as:
|
|
.
σ
y
=
c
1
S
|
σ
y
=
c
2
S
|
DKM
(
y, S
)=2
·
·
(5.8)
|
S
|
|
S
|
It has been theoretically proved that this criterion requires smaller trees
for obtaining a certain error than other impurity-based criteria (information
gain and Gini index).
5.1.7
Normalized Impurity-based Criteria
The impurity-based criterion described above is biased towards attributes
with larger domain values. Namely, it prefers input attributes with many
Search WWH ::
Custom Search