Dealing with Missing Values - Data Preprocessing in Data Mining

Graphics Reference

In-Depth Information

E c |−

E k |−

( |

)( |

)

where df is the corresponding degree of freedomhaving the value

A chi-squared test is then used to select interdependent variables in X at a presumed

significant level.

The cluster regrouping process uses an information measure to regroup data itera-

tively. Wong et al. have proposed an informationmeasure called normalized surprisal

(NS) to indicate significance of joint information. Using this measure, the informa-

tion conditioned by an observed event x k is weighted according to R

X k ,

C K

, their

measure of interdependency with the cluster label variable. Therefore, the higher the

interdependency of a conditioning event, the more relevant the event is. NS measures

the joint information of a hypothesized value based on the selected set of significant

components. It is defined as

(

)

x (

(

a cj |

a cj ))

x (

(

a cj |

a cj )) =

m k = 1 R

)

(4.58)

X k ,

C k

(

x (

where I

is the summation of theweighted conditional information defined

on the incomplete probability distribution scheme as

(

a cj |

a cj ))

x (

X k ,

C k

(

a cj |

a cj )) =

(

)

(

a cj |

x k ))

k =

(

a cj |

x k )

X k ,

C k

a cu ∈ E c

(

)

−

log

(4.59)

(

a cu |

x k )

k = 1

In rendering a meaningful calculation in the incomplete probability scheme formu-

lation, x k is selected if

(

a cu |

x k )>

(4.60)

E c

a cu

∈

where T

0 is a size threshold for meaningful estimation. NS can be used in a

decision rule in the regrouping process. Let C

≥

a c 1 ,...,

a cq }

be the set of possible

cluster labels. We assign a cj to x e if

x (

(

a cj |

a cj )) =

min

a cu ∈

(

a cu |

a cu )).

If no component is selected with respect to all hypothesized cluster labels, or if

there is more than one label associated with the same minimum NS, then the sample

is assigned a dummy label, indicating that the estimated cluster label is still uncertain.

Also, zero probability may be encountered in the probability estimation, an unbiased

probability based on Entropy minimax . In the regrouping algorithm, the cluster label

for each sample is estimated iteratively until a stable set of label assignments is

attained.

Data Preprocessing in Data Mining

Search WWH ::

Custom Search

Home