Database Reference
In-Depth Information
The same idea is used for classifying a new instance with missing
attribute values. When an instance encounters a node where its splitting
criteria can be evaluated due to a missing value, it is passed through to
all outgoing edges. The predicted class will be the class with the highest
probability in the weighted union of all the leaf nodes at which this instance
ends up.
Another approach known as surrogate splits is implemented in the
CART algorithm. The idea is to find for each split in the tree a surrogate
split which uses a different input attribute and which most resembles the
original split. If the value of the input attribute used in the original split
is missing, then it is possible to use the surrogate split. The resemblance
between two binary splits over sample S is formally defined as:
res ( a i ,dom 1 ( a i ) ,dom 2 ( a i ) ,a j ,dom 1 ( a j ) ,dom 2 ( a j ) ,S )
= σ a i ∈dom 1 ( a i )
a j ∈dom 1 ( a j ) S
+ σ a i ∈dom 2 ( a i )
a j ∈dom 2 ( a j ) S
AND
AND
,
|
S
|
|
S
|
(5.18)
where the first split refers to attribute a i and it splits dom ( a i )into dom 1 ( a i )
and dom 2 ( a i ). The alternative split refers to attribute a j
and splits its
domain to dom 1 ( a j )and dom 2 ( a j ).
The missing value can be estimated based on other instances. On the
learning phase, if the value of a nominal attribute a i in tuple q is missing,
then it is estimated by its mode over all instances having the same target
attribute value. Formally,
σ a i = v i,j
AND y = y q S ,
estimate ( a i ,y q ,S ) =
argmax
v i,j ∈dom ( a i )
(5.19)
where y q denotes the value of the target attribute in the tuple q .Ifthe
missing attribute a i is numeric, then, instead of using mode of a i ,itis
more appropriate to use its mean.
Search WWH ::




Custom Search