Information Technology Reference
In-Depth Information
the dependencies between attributes to predict missing attribute values. There-
fore, we are interested in the type of approaches which use decision trees to
fill in missing values [16, 21], because if a decision tree is able to determine
the class of an instance thanks to the values of its attributes, then it can be
used to determine the value of an unknown attribute from its dependent at-
tributes. We have extended the Ordered Attribute Trees method (OAT) ,pro-
posed by Lobo and Numao [16]. They use decision trees to deal with missing
values and they fill missing values both in training data and test data [16]. Our
approach to deal with missing values uses decision trees during the classification
phase. The result of classification is a probabilistic distribution according to the
class values.
In our experimentation, we have tested our approach on several databases
[20], aiming at measuring and evaluating the quality of our classification results.
For this purpose, we have compared each instance in the test data with all the
instances in the training data by calculating the distance between them. Our
approach is based on an algorithm called Relief [12] and its extension ReliefF
[14, 24], to calculate the distance between two instances. For each test instance,
we calculate the frequency of its nearest instances from each class. This frequency
is compared with the classification results obtained by our approach and the
C4.5 method for the same test instance. The Relief algorithm, developed by
[12], and its extension ReliefF [14, 24] are measures for classification. They take
into account the context of other attributes when estimating the quality of an
attribute with respect to the class.
In this chapter, we first present the work in the domain, and particularly
Lobo's approach (OAT) and C4.5's method. We then describe our method to
estimate missing values that uses the dependencies between attributes and gives
a probabilistic result; we then present the tests performed on several databases,
using our approach, OAT's method and Quinlan's method. We also measure the
quality of our classification results. Finally, we calculate the complexity of our
method and we present some perspectives.
4.1.1
Related Work
We present in this section the methods to deal with missing values using deci-
sion trees 1 . The general idea in filling missing values is to infer them from other
known data. We can distinguish several approaches to deal with missing values.
The simplest one is to ignore instances containing missing values [15]. The second
type of technique consists in replacing a missing value with a value considered as
adequate in the situation. For example, [13] proposes a method that uses class
information to estimate missing attribute values during the training phase; the
idea is to assign the most probable value of the attribute to the missing value,
given the class membership of the case concerned. [22] fills in the missing values
of an attribute with its most common known value in the training set during the
1 The methods which deal with missing values in the statistical domain are not pre-
sented in this chapter [2].
 
Search WWH ::




Custom Search