Dealing with Missing Values - Data Preprocessing in Data Mining

Graphics Reference

In-Depth Information

Euclidean distance), and uses the class labels of such neighbors in order to classify

the considered instance. If the instance is not correctly classified, then the variable

noise is increased by one unit. Therefore, the final noise ratio will be

noise

#instances in the data set

Wilson's Noise

After imputing a data set with different imputation methods, we can measure how

disturbing the imputationmethod is for the classification task. Thus by usingWilson's

noise ratio we can observe which imputation methods reduce the impact of the MVs

as a noise, and which methods produce noise when imputing.

Another approach is to use the MI (MI) which is considered to be a good indicator

of relevance between two random variables [ 18 ]. Recently, the use of the MI measure

in FS has become well-known and seen to be successful [ 51 , 52 , 66 ]. The use of

the MuI measure for continuous attributes has been tackled by [ 51 ], allowing us to

compute the Mui measure not only in nominal-valued data sets.

In our approach, we calculate the Mui between each input attribute and the class

attribute, obtaining a set of values, one for each input attribute. In the next step we

compute the ratio between each one of these values, considering the imputation of

the data set with one imputation method in respect to the not imputed data set. The

average of these ratios will show us if the imputation of the data set produces a gain

in information:

x i ∈ X

Mui α ( x i ) +

Mui ( x i ) + 1

Avg. Mui Ratio

where X is the set of input attributes, Mui α (

)

represents the Mui value of the i th

(

)

attribute in the imputed data set and Mui

is the Mui value of the i th input attribute

in the not imputed data set. We have also applied the Laplace correction, summing

1 to both numerator and denominator, as an Mui value of zero is possible for some

input attributes.

The calculation of Mui

depends on the type of attribute x i . If the attribute x i

is nominal, the Mui between x i and the class label Y is computed as follows:

(

x i )

(

)

Mui nominal (

x i ) =

(

x i ;

) =

(

)

log 2

) .

(

)

(

∈

x i

∈

On the other hand, if the attribute x i is numeric, we have used the Parzen window

density estimate as shown in [ 51 ] considering a Gaussian window function:

Mui numeric (

x i ) =

(

x i ;

) =

(

) −

(

) ;

Data Preprocessing in Data Mining

Search WWH ::

Custom Search

Home