Dealing with Missing Values in a Probabilistic Decision Tree during Classification - Mining Complex Data

Information Technology Reference

In-Depth Information

4

Dealing with Missing Values in a Probabilistic

Decision Tree during Classification

Lamis Hawarah, Ana Simonet, and Michel Simonet

Institut d'Ingenierie et de l'Information de Sant´e(TIMC)

Facult´edeMedecine

38700 La Tronche, France

{ lamis.hawarah,ana.simonet,michel.simonet@imag.fr }

This chapter deals with the problem of missing values in decision trees during classifi-

cation. Our approach is derived from the ordered attribute trees method, proposed by

Lobo and Numao in 2000, which builds a decision tree for each attribute and uses these

trees to fill the missing attribute values. Our method takes into account the depen-

dence between attributes by using Mutual Information. The result of the classification

process is a probability distribution instead of a single class. In this chapter, we ex-

plain our approach, we then present tests performed of our approach on several real

databases and we compare them with those given by Lobo's method and Quinlan's

method. We also measure the quality of our classification results. Finally, we calculate

the complexity of our approach and we discuss some perspectives.

4.1

Introduction

In classification, the goal of a learning algorithm is to build a classifier from a

training set. Each example in such a training set is assigned to a class. Classi-

fication is the task of assigning objects to their respective categories. Decision

Trees are one of the most popular classification algorithms currently in use in

Data Mining and Machine Learning. Decision Trees belong to supervised clas-

sification methods. Once built, decision trees are used to classify new cases. A

case is classified by starting at the root node of the tree, testing the attribute

specified by this node, then moving through the tree until a leaf is encountered;

the case is classified by the class associated with the leaf. It may happen that

some objects have no value for some attributes. We can encounter this problem,

known as the problem of missing values, both during the construction phase and

the classification phase of a decision tree. In the latter situation, when classi-

fying an object, if the value of a particular attribute which was branched on

in the tree is missing in the object, it is not possible to decide which branch

to take in order to classify this object, and the classification process cannot be

completed.

Our objective is to classify an object with missing values. Our work is situ-

ated in the framework of probabilistic decision trees [1, 5, 23]. We aim at using

Search WWH ::

Custom Search

Home