Dealing with Missing Values - Data Preprocessing in Data Mining

Graphics Reference

In-Depth Information

a better description tool when the clusters are not well-separated, as is the case in

missing data imputation. Moreover, the original K-means clustering may be trapped

in a local minimum status if the initial points are not selected properly. However,

continuous membership values in fuzzy clustering make the resulting algorithms less

susceptible to get stuck in a local minimum situation.

In fuzzy clustering, each data object x i has a membership function which describes

the degree to which this data object belongs to certain cluster v k . The membership

function is defined in the next equation

x i ) − 27 ( m − 1 )

(

v k ,

(

v k ,

x i ) =

(4.29)

j = 1 d

x i ) − 2 /( m − 1 )

(

v j ,

1 is the fuzzifier, and j = 1 U

where m

(

v j ,

x i ) =

1 for any data object

x i (

. Now we can not simply compute the cluster centroids by the mean

values. Instead, we need to consider the membership degree of each data object.

Equation ( 4.30 ) provides the formula for cluster centroid computation:

≤

)

i = 1 U

(

v k ,

x i ) ×

x i

v k =

(4.30)

i = 1 U

(

v k ,

x i )

Since there are unavailable data in incomplete objects, we use only reference

attributes to compute the cluster centroids.

The algorithm for missing data imputation with fuzzy K-means clustering method

also has three processes. Note that in the initialization process, we pick K centroids

which are evenly distributed to avoid local minimum situation. In the second process,

we iteratively update membership functions and centroids until the overall distance

meets the user-specified distance threshold

. In this process, we cannot assign the

data object to a concrete cluster represented by a cluster centroid (as did in the basic

K-mean clustering algorithm), because each data object belongs to all K clusters

with different membership degrees. Finally, we impute non-reference attributes for

each incomplete object. We replace non-reference attributes for each incomplete

data object x i based on the information about membership degrees and the values of

cluster centroids, as shown in next equation:

x i , j =

(

x i ,

v k ) ×

v k , j ,

for any non-reference attribute j

∈

(4.31)

4.5.5 Support Vector Machines Imputation (SVMI)

Support Vector Machines Imputation [ 29 ] is an SVM regression based algorithm

to fill in MVs, i.e. set the decision attributes (output or classes) as the condition

Data Preprocessing in Data Mining

Search WWH ::

Custom Search

Home