Dealing with Missing Values in a Probabilistic Decision Tree during Classification - Mining Complex Data

Information Technology Reference

In-Depth Information

Table 4.12. The Root Mean Squared Error

DataBase

PAT

C4.5

OAT

Vote

0.310443

0.52039

0.315079

Nursery

0.4456728

0.44999

0.436149

Lymphography

0.260477

0.477835

0.420603

Mushroom

0412543

0.643147

0.535865

Zoo

0.133817

0.44812

0.245

4.5

Measure of the Quality of the Classification Results

Our approach is based on the dependence between attributes. The results of

classification given by our approach are probabilistic. We measured the quality

of our classification results in order to improve the performance of our approach.

For this purpose, we considered an algorithm [12] called Relief , which has been

shown to be very ecient in estimating attributes. We were interested in Relief

because it relies entirely on statistical analysis and employs few heuristics. On

the other hand, the classical measures for classification 8 evaluate the quality

of an attribute with respect to the class independently of the context of other

attributes [25]. However, Relief takes into account the context of other attributes

when estimating the quality of an attribute with respect to the class. The basic

idea of Relief , when analysing training instances, is to take into account not

only the difference in attribute values and the difference in classes, but also the

distance between instances. In this section, we first present the algorithm Relief ,

its extension ReilefF and the Distance function used to calculate the distance

between two instances. We then propose an algorithm which calculates for each

test instance in the test data the frequency of its nearest instances from each

class. Finally, we give some examples.

4.5.1

Relief

The key idea of Relief is to estimate attributes according to how well their values

distinguish among instances that are close to each other. For that purpose, given

a randomly selected instance R from m instances, Relief [14] searches for its two

nearest neighbors: one H from the same class and the other M from a different

class. It uses a function diff that calculates the difference between the values

of Attribute for two instances. For a discrete attribute this difference is either

1 when the values are different or 0 when the values are equal. Estimating the

quality W[A] of attribute A is defined as shown below:

W [ A ]= W [ A ] − diff ( A, R, H ) /m + diff ( A, R, M ) /m

(4.3)

Relief updates the quality estimation W[A] for all the attributes A depending

on their values for R , M and H .Thisisrepeated m times according to the m

8 As information gain, gain ration, distance measure and Gini-index, etc.

Mining Complex Data

Search WWH ::

Custom Search

Home