Dealing with Missing Values in a Probabilistic Decision Tree during Classification - Mining Complex Data

Information Technology Reference

In-Depth Information

Complexity ( T rees )= O ( nbP ath Tree Log ( L Tree ) )

i = nbAT

+ O (

ComplexityTree ( Tree i ))

i =1

(4.11)

⎨

⎩

1 if no missing values

L Tree if all are missing

nbP T else

nbP ath Tree =

(4.12)

From equation 4.11, we remark that nbP ath Tree may be L Tree in the worst-case

when all the attributes are missing. Therefore, we can note that the complex-

ity O( nbP ath Tree Log v ( L Tree )) in this equation is quasilinear according to the

number of leaves in the final decision tree.

Consequently, the total complexity of classifying a new instance with missing

values becomes:

Complexity ( T rees )= O ( nbP ath Tree Log ( L Tree ) )

+ O ( m × v m log ( L T ))

This complexity is exponential in the number of missing attributes in the instance

to classify and quasilinear in the number of leaves in the final decision tree.

4.7

Conclusion and Perspectives

In this chapter, we have introduced a probabilistic approach to fill missing val-

ues in decision trees during classification. We proposed replacing an unknown

attribute with a probability distribution and taking into account the depen-

dence between attributes. We presented the results of tests performed on several

databases and we compared our results with those given by C4.5 and Lobo's ap-

proach for the same databases. We have observed that the results of our approach

are better than those obtained by OAT and C4.5. To measure the quality of our

classification results, we used an approach derived from Relief and its extensions

to calculate the distance between two instances with missing values. For each

instance in the test data, our Analysis-Instance algorithm gives the frequency

of its nearest instances from each class. We then compared the results obtained

by the Analysis-Instance algorithm with PAT and C4.5 results. We observed

that our classification results are closer to the Analysis-Instance algorithm re-

sults and better than those given by C4.5. The complexity of constructing our

attributes trees in the two approaches (PAOTs, PATs) and the complexity of

classifying a new instance with missing values using our approach were briefly

presented in this chapter. We found that the complexity of our classification al-

gorithm is exponential according to the number of missing attributes in the test

instance [10].

In the future, we aim at using a statistical test to calculate the dependence

between attributes instead of Mutual Information. We are also going to test our

Mining Complex Data

Search WWH ::

Custom Search

Home