Database Reference
In-Depth Information
It should be noted that if the probability vector has a component of 1
(the variable x gets only one value), then the variable is defined as pure.
On the other hand, if all components are equal the level of impurity reaches
maximum.
Given a training set S the probability vector of the target attribute y
is defined as:
P y ( S )=
,..., σ y = c |dom ( y ) | S
|
.
|
σ y = c 1 S
|
(5.1)
|
S
|
S
|
The goodness-of-split due to discrete attribute a i is defined as reduction
in impurity of the target attribute after partitioning S according to the
values v i,j
dom ( a i ):
|dom ( a i ) |
|
σ a i = v i,j S
|
∆Φ( a i ,S )= φ ( P y ( S ))
· φ ( P y ( σ a i = v i,j S )) .
(5.2)
|
S
|
j =1
5.1.3
Information Gain
Information Gain is an impurity-based criteria that uses the entropy
measure (originating from information theory) as the impurity measure.
Information Gain( a i ,S )=
σ a i = v i,j S
|
(5.3)
Entropy ( y, S )
·
Entropy ( y, σ a i = v i,j S ) ,
S
|
v i,j ∈dom ( a i )
where:
σ y = c j S
|
σ y = c j S
|
Entropy ( y, S )=
c j ∈dom ( y )
·
log 2
.
(5.4)
S
|
S
|
Information gain is closely related to the maximum likelihood estima-
tion (MLE), which is a popular statistical method used to make inferences
about parameters of the underlying probability distribution from a given
dataset.
5.1.4
Gini Index
The Gini index is an impurity-based criteria that measures the divergences
between the probability distributions of the target attributes values. The
Search WWH ::




Custom Search