Graphics Reference
In-Depth Information
4.5.3 K-means Clustering Imputation (KMI)
In K-means clustering [ 53 ], the intra-cluster dissimilarity is measured by the summa-
tion of distances between the objects and the centroid of the cluster they are assigned
to. A cluster centroid represents the mean value of the objects in the cluster. Given a
set of objects, the overall objective of clustering is to divide the data set into groups
based on the similarity of objects, and to minimize the intra-cluster dissimilarity.
KMI measures the intra-cluster dissimilarity by the addition of distances among the
objects and the centroid of the cluster which they are assigned to. A cluster centroid
represents the mean value of the objects in the cluster. Once the clusters have con-
verged, the last process is to fill in all the non-reference attributes for each incomplete
object based on the cluster information. Data objects that belong to the same cluster
are taken to be nearest neighbors of each other, and KMI applies a nearest neighbor
algorithm to replace MVs, in a similar way to KNNI.
Given a set of N objects X
=
x 1 ,
x 2 ,
ldots
,
x N where each object has S attributes,
we use x ij (
to denote the value of attribute j in object x i .
Object x i is called a complete object, if
1
i
N and1
j
S
)
{
x ij
= φ |∀
1
j
S
}
, and an incomplete
object, if
, and we say object x i has a MV on attribute j .
For any incomplete object x i ,weuse R
{
x ij
= φ |∃
1
j
S
}
to denote the
set of attributes whose values are available, and these attributes are called reference
attributes. Our objective is to obtain the values of non-reference attributes for the
incomplete objects. By K-means clustering method, we divide data set X into K
clusters, and each cluster is represented by the centroid of the set of objects in the
cluster. Let V
={
j
|
x ij
= φ,
1
j
S
}
=
v 1 ,...,
v k be the set of K clusters, where v k (
)
represents
the centroid of cluster k . Note that v k is also a vector in a S -dimensional space. We
use d
1
k
K
to denote the distance between centroid v k and object x i .
KMI can be divided into three processes. First, randomly select K complete data
objects as K centroids. Second, iteratively modify the partition to reduce the sum
of the distances for each object from the centroid of the cluster to which the object
belongs. The process terminates once the summation of distances is less than a user-
specified threshold
(
v k ,
x i )
100, or no change on the centroids weremade in last iteration.
The last process is to fill in all the non-reference attributes for each incomplete object
based on the cluster information. Data objects that belong to the same cluster are
taken as nearest neighbors of each other, and we apply a nearest neighbor algorithm
to replace missing data. We use as a distance measure the Euclidean distance.
ε =
4.5.4 Imputation with Fuzzy K-means Clustering (FKMI)
In fuzzy clustering, each data object has a membership function which describes the
degree to which this data object belongs to a certain cluster. Now we want to extend
the original K-means clustering method to a fuzzy version to impute missing data
[ 1 , 53 ]. The reason for applying the fuzzy approach is that fuzzy clustering provides
 
Search WWH ::




Custom Search