Database Reference
In-Depth Information
of the comparisons are based on empirical results, although there are some
theoretical conclusions.
Most of the researchers point out that in nearly all of the cases the
choice of splitting criteria will not make much difference on the tree
performance. As the no-free lunch theorem suggests, each criterion is
superior in some cases and inferior in others.
5.2 Handling Missing Values
Missing values are a common experience in real-world datasets. This
situation can complicate both induction (a training set where some of its
values are missing) as well as classification of a new instance that is missing
certain values.
The problem of missing values has been addressed by several researchers
such as [ Friedman (1977) ] , [ Breiman et al . (1984) ] and [ Quinlan (1989) ] .
Friedman (1977) suggests handling missing values in the training set in the
following way. Let σ a i =? S indicate the subset of instances in S whose a i
values are missing. When calculating the splitting criteria using attribute
a i , simply ignore all instances whose values in attribute a i are unknown.
Instead of using the splitting criteria ∆Φ( a i ,S )weuse∆Φ( a i ,S
σ a i =? S ).
On the other hand, Quinlan (1989) argues that in case of missing values,
the splitting criteria should be reduced proportionally as nothing has been
learned from these instances. In other words, instead of using the splitting
criteria ∆Φ( a i ,S ), we use the following correction:
|S − σ a i =? S
|
∆Φ( a i ,S
σ a i =? S ) .
(5.15)
|
S
|
In cases where the criterion value is normalized (as in the case of
gain ratio), the denominator should be calculated as if the missing values
represent an additional value in the attribute domain. For instance, the
gain ratio with missing values should be calculated as follows:
GainRatio ( a i ,S )=
|
|
S−σ a i =?
S
Inf ormationGain ( a i ,S−σ a i =? S )
.
(5.16)
|S|
|
|
log( |
|
|
|
log( |
|
σ a i =?
S
σ a i =?
S
P
v i,j ∈dom ( a i )
σ a i = v i,j S
σ a i = v i,j S
)
)
|S|
|S|
|S|
|S|
Once a node is split, Quinlan (1989) suggests adding σ a i =? S to each
one of the outgoing edges with the following corresponding weight:
σ a i = v i,j S |
S
σ a i =? S
|
.
(5.17)
Search WWH ::




Custom Search