Database Reference
In-Depth Information
It should be noted that if the probability vector has a component of 1
(the variable
x
gets only one value), then the variable is defined as pure.
On the other hand, if all components are equal the level of impurity reaches
maximum.
Given a training set
S
the probability vector of the target attribute
y
is defined as:
P
y
(
S
)=
,...,
σ
y
=
c
|dom
(
y
)
|
S
|
.
|
σ
y
=
c
1
S
|
(5.1)
|
S
|
S
|
The goodness-of-split due to discrete attribute
a
i
is defined as reduction
in impurity of the target attribute after partitioning
S
according to the
values
v
i,j
∈
dom
(
a
i
):
|dom
(
a
i
)
|
|
σ
a
i
=
v
i,j
S
|
∆Φ(
a
i
,S
)=
φ
(
P
y
(
S
))
−
· φ
(
P
y
(
σ
a
i
=
v
i,j
S
))
.
(5.2)
|
S
|
j
=1
5.1.3
Information Gain
Information Gain is an impurity-based criteria that uses the entropy
measure (originating from information theory) as the impurity measure.
Information Gain(
a
i
,S
)=
σ
a
i
=
v
i,j
S
|
(5.3)
Entropy
(
y, S
)
−
·
Entropy
(
y, σ
a
i
=
v
i,j
S
)
,
S
|
v
i,j
∈dom
(
a
i
)
where:
σ
y
=
c
j
S
|
σ
y
=
c
j
S
|
Entropy
(
y, S
)=
c
j
∈dom
(
y
)
−
·
log
2
.
(5.4)
S
|
S
|
Information gain is closely related to the maximum likelihood estima-
tion (MLE), which is a popular statistical method used to make inferences
about parameters of the underlying probability distribution from a given
dataset.
5.1.4
Gini Index
The Gini index is an impurity-based criteria that measures the divergences
between the probability distributions of the target attributes values. The
Search WWH ::
Custom Search