Dealing with Missing Values - Data Preprocessing in Data Mining

Graphics Reference

In-Depth Information

methods estimate the MVs using such relationships and can afford improvements in

the performance of the classifiers. Hence the highest values of average Mui ratios

could be related to those methods which can obtain better estimates for the MVs, and

maintaining the relationship degree between the class labels and the isolated input

attributes. It is interesting to note that when analyzing the Mui ratio, the values do

not appear to be as highly data dependant as Wilson's noise ratio, as the values for

all the data sets are more or less close to each other.

If we count the methods with the lowest Wilson's noise ratios in each data set

in Table 4.2 , we find that the CMC method is first, with 12 times being the lowest

one, and the EC method is second with 9 times being the lowest one. If we count the

methods with the highest MI ratio in each data set, the EC method has the highest

ratio for 7 data sets and is therefore the first one. The CMC method has the highest

ratio for 5 data sets and is the second one in this case. Immediately the next question

arises: are these methods also the best for the performance of the learning methods

applied afterwards? We try to answer this question in the following.

4.6.2 Best Imputation Methods for Classification Methods

Our aim is to use the same imputation results as data sets used in the previous

Sect. 4.6.1 as the input for a series of well known classifiers in order to shed light on

the question “which is the best imputation method?”. Let us consider a wide range

of classifiers grouped by their nature, as that will help us to limit the comparisons

needed to be made. We have grouped them in three sub-categories. In Table 4.3

we summarize the classification methods we have used, organized in these three

categories. The description of the former categories is as follows:

•

The first group is the Rule Induction Learning category. This group refers to

algorithms which infer rules using different strategies.

•

The second group represents the Black Box Methods . It includes ANNs, SVMs

and statistical learning.

•

The third and last group corresponds to the Lazy Learning (LL) category. This

group incorporates methods which do not create any model, but use the training

data to perform the classification directly.

Somemethods do not workwith numerical attributes (CN2, AQandNaïve-Bayes).

In order to discretize the numerical values, we have used the well-known discretizer

proposed by [ 28 ]. For the SVM methods (C-SVM,

-SVM and SMO), we have

applied the usual preprocessing in the literature to these methods [ 25 ]. This pre-

processing consists of normalizing the numerical attributes to the

ν

range, and

binarizing the nominal attributes. Some of the presented classification methods in the

previous section have their own MVs treatment that will trigger when no imputation

is made (DNI): C4.5 uses a probabilistic approach to handling MVs and CN2 applies

the MC method by default in these cases. For ANNs [ 24 ] proposed to replace MVs

with zero so as not to trigger the corresponding neuron which the MV is applied to.

[

0

,

1

]

Data Preprocessing in Data Mining

Search WWH ::

Custom Search

Home