Dealing with Missing Values - Data Preprocessing in Data Mining

Graphics Reference

In-Depth Information

4.5.3 K-means Clustering Imputation (KMI)

In K-means clustering [ 53 ], the intra-cluster dissimilarity is measured by the summa-

tion of distances between the objects and the centroid of the cluster they are assigned

to. A cluster centroid represents the mean value of the objects in the cluster. Given a

set of objects, the overall objective of clustering is to divide the data set into groups

based on the similarity of objects, and to minimize the intra-cluster dissimilarity.

KMI measures the intra-cluster dissimilarity by the addition of distances among the

objects and the centroid of the cluster which they are assigned to. A cluster centroid

represents the mean value of the objects in the cluster. Once the clusters have con-

verged, the last process is to fill in all the non-reference attributes for each incomplete

object based on the cluster information. Data objects that belong to the same cluster

are taken to be nearest neighbors of each other, and KMI applies a nearest neighbor

algorithm to replace MVs, in a similar way to KNNI.

Given a set of N objects X

x 1 ,

x 2 ,

ldots

x N where each object has S attributes,

we use x ij (

to denote the value of attribute j in object x i .

Object x i is called a complete object, if

≤

N and1

≤

)

{

x ij

= φ |∀

≤

}

, and an incomplete

object, if

, and we say object x i has a MV on attribute j .

For any incomplete object x i ,weuse R

{

x ij

= φ |∃

≤

}

to denote the

set of attributes whose values are available, and these attributes are called reference

attributes. Our objective is to obtain the values of non-reference attributes for the

incomplete objects. By K-means clustering method, we divide data set X into K

clusters, and each cluster is represented by the centroid of the set of objects in the

cluster. Let V

x ij

= φ,

≤

}

v 1 ,...,

v k be the set of K clusters, where v k (

≤

)

represents

the centroid of cluster k . Note that v k is also a vector in a S -dimensional space. We

use d

to denote the distance between centroid v k and object x i .

KMI can be divided into three processes. First, randomly select K complete data

objects as K centroids. Second, iteratively modify the partition to reduce the sum

of the distances for each object from the centroid of the cluster to which the object

belongs. The process terminates once the summation of distances is less than a user-

specified threshold

(

v k ,

x i )

100, or no change on the centroids weremade in last iteration.

The last process is to fill in all the non-reference attributes for each incomplete object

based on the cluster information. Data objects that belong to the same cluster are

taken as nearest neighbors of each other, and we apply a nearest neighbor algorithm

to replace missing data. We use as a distance measure the Euclidean distance.

ε =

4.5.4 Imputation with Fuzzy K-means Clustering (FKMI)

In fuzzy clustering, each data object has a membership function which describes the

degree to which this data object belongs to a certain cluster. Now we want to extend

the original K-means clustering method to a fuzzy version to impute missing data

[ 1 , 53 ]. The reason for applying the fuzzy approach is that fuzzy clustering provides

Data Preprocessing in Data Mining

Search WWH ::

Custom Search

Home