Dealing with Missing Values in a Probabilistic Decision Tree during Classification - Mining Complex Data

Information Technology Reference

In-Depth Information

the dependencies between attributes to predict missing attribute values. There-

fore, we are interested in the type of approaches which use decision trees to

fill in missing values [16, 21], because if a decision tree is able to determine

the class of an instance thanks to the values of its attributes, then it can be

used to determine the value of an unknown attribute from its dependent at-

tributes. We have extended the Ordered Attribute Trees method (OAT) ,pro-

posed by Lobo and Numao [16]. They use decision trees to deal with missing

values and they fill missing values both in training data and test data [16]. Our

approach to deal with missing values uses decision trees during the classification

phase. The result of classification is a probabilistic distribution according to the

class values.

In our experimentation, we have tested our approach on several databases

[20], aiming at measuring and evaluating the quality of our classification results.

For this purpose, we have compared each instance in the test data with all the

instances in the training data by calculating the distance between them. Our

approach is based on an algorithm called Relief [12] and its extension ReliefF

[14, 24], to calculate the distance between two instances. For each test instance,

we calculate the frequency of its nearest instances from each class. This frequency

is compared with the classification results obtained by our approach and the

C4.5 method for the same test instance. The Relief algorithm, developed by

[12], and its extension ReliefF [14, 24] are measures for classification. They take

into account the context of other attributes when estimating the quality of an

attribute with respect to the class.

In this chapter, we first present the work in the domain, and particularly

Lobo's approach (OAT) and C4.5's method. We then describe our method to

estimate missing values that uses the dependencies between attributes and gives

a probabilistic result; we then present the tests performed on several databases,

using our approach, OAT's method and Quinlan's method. We also measure the

quality of our classification results. Finally, we calculate the complexity of our

method and we present some perspectives.

4.1.1

Related Work

We present in this section the methods to deal with missing values using deci-

sion trees 1 . The general idea in filling missing values is to infer them from other

known data. We can distinguish several approaches to deal with missing values.

The simplest one is to ignore instances containing missing values [15]. The second

type of technique consists in replacing a missing value with a value considered as

adequate in the situation. For example, [13] proposes a method that uses class

information to estimate missing attribute values during the training phase; the

idea is to assign the most probable value of the attribute to the missing value,

given the class membership of the case concerned. [22] fills in the missing values

of an attribute with its most common known value in the training set during the

1 The methods which deal with missing values in the statistical domain are not pre-

sented in this chapter [2].

Mining Complex Data

Search WWH ::

Custom Search

Home