Splitting Criteria - Data Mining with Decision Trees: Theory and Applications

Database Reference

In-Depth Information

It should be noted that if the probability vector has a component of 1

(the variable x gets only one value), then the variable is defined as pure.

On the other hand, if all components are equal the level of impurity reaches

maximum.

Given a training set S the probability vector of the target attribute y

is defined as:

P y ( S )=

,..., σ y = c |dom ( y ) | S

σ y = c 1 S

(5.1)

The goodness-of-split due to discrete attribute a i is defined as reduction

in impurity of the target attribute after partitioning S according to the

values v i,j ∈

dom ( a i ):

|dom ( a i ) |

σ a i = v i,j S

∆Φ( a i ,S )= φ ( P y ( S ))

−

· φ ( P y ( σ a i = v i,j S )) .

(5.2)

j =1

5.1.3

Information Gain

Information Gain is an impurity-based criteria that uses the entropy

measure (originating from information theory) as the impurity measure.

Information Gain( a i ,S )=

σ a i = v i,j S

(5.3)

Entropy ( y, S )

−

Entropy ( y, σ a i = v i,j S ) ,

v i,j ∈dom ( a i )

where:

σ y = c j S

Entropy ( y, S )=

c j ∈dom ( y )

−

log 2

(5.4)

Information gain is closely related to the maximum likelihood estima-

tion (MLE), which is a popular statistical method used to make inferences

about parameters of the underlying probability distribution from a given

dataset.

5.1.4

Gini Index

The Gini index is an impurity-based criteria that measures the divergences

between the probability distributions of the target attributes values. The

Data Mining with Decision Trees: Theory and Applications

Search WWH ::

Custom Search

Home