Dealing with Missing Values - Data Preprocessing in Data Mining

Graphics Reference

In-Depth Information

expressed as:

(

) =

(

x mj |

x m k ( j ) ),

(

(4.48)

where (1) the index set m 1 ,

m 2 ,...,

m n is a permutation of the integer set 1

,...,

n ,

(2) the ordered pairs x mj ,

x m k ( j ) are chosen so that they the set of branches of a spanning

tree defined on X with their summed MI maximized, and (3) P

The probability defined above is known to be the best second-order approximation of

the high-order probability distribution. Then corresponding to each x in the ensemble,

a probability P

(

x m 1 |

x m 0 ) =

(

x m 1 )

can be estimated.

In general, it is more likely for samples of relatively high probability to form

clusters. By introducing the mean probability below, samples can be divided into

two subsets: those above the mean and those below. Samples above the mean will

be considered first for cluster initiation.

Let S

(

)

x . The mean probability is defined as

μ s =

(

)/ |

(4.49)

∈

where

is the number of samples in S . For more details in the probability estimation

with dependence tree product approximation , please refer to [ 13 ].

When distance is considered for cluster initiation, we can use the following criteria

in assigning a sample x toacluster.

D ∗

1. If there existsmore than one cluster, say C k |

,...

, such that D

(

C k ) ≤

for all k , then all these clusters can be merged together.

2. If exactly one cluster C k exists, such that D

D ∗ , then x can be grouped

(

C k ) ≤

into C k .

3. If D

D ∗ for all clusters C k , then x may not belong to any cluster.

To avoid including distance calculation of outliers, we use a simple method suggested

in [ 99 ] which assigns D ∗ the maximum value of all nearest-neighbor distances in L

provided there is a sample in L having a nearest-neighbor distance value of D ∗ −

(

C K )>

(with the distance values rounded to the nearest integer value).

After finding the initial clusters along with their membership, the regrouping

process is thus essentially an inference process for estimating the cluster label of

a sample. Let C

a cq be the set of labels for all possible clusters to

which x can be assigned. For X k in X , we can form a contingency table between X k

and C .Let a ks and a cj be possible outcomes of X k and C respectively, and let obs

a c 1 ,

a c 2 ,...,

a ks

and obsa cj be the respectively marginal frequencies of their observed occurrences.

The expected relative frequency of

(

a ks ,

a cj )

is expressed as:

(

a ks ) ×

(

a cj )

obs

exp

(

a ks ,

a cj ) =

(4.50)

Data Preprocessing in Data Mining

Search WWH ::

Custom Search

Home