Advanced Cluster Analysis - Data Mining: Concepts and Techniques

Databases Reference

In-Depth Information

, C k , through cluster analysis.

For a data set, D , of n objects, we can regard D as a finite sample of the possible instances

of the clusters. Conceptually, we can assume that D is formed as follows. Each cluster,

C j .

Suppose we want to find k probabilistic clusters, C 1 ,

:::

1 j k

, is associated with a probability,

! j , that some instance is sampled from

the cluster. It is often assumed that

! 1 ,

:::

! k are given as part of the problem setting,

and that P j D1 ! j D 1, which ensures that all objects are generated by the k clusters.

Here, parameter

! j captures background knowledge about the relative population of

cluster C j .

We then run the following two steps to generate an object in D . The steps are executed

n times in total to generate n objects, o 1 ,

:::

, o n , in D .

1. Choose a cluster, C j , according to probabilities

! k .

2. Choose an instance of C j according to its probability density function, f j .

! 1 ,

:::

The data generation process here is the basic assumption in mixture models. Formally,

a mixture model assumes that a set of observed objects is a mixture of instances from

multiple probabilistic clusters. Conceptually, each observed object is generated indepen-

dently by two steps: first choosing a probabilistic cluster according to the probabilities of

the clusters, and then choosing a sample according to the probability density function

of the chosen cluster.

Given data set, D , and k , the number of clusters required, the task of probabilistic

model-based cluster analysis is to infer a set of k probabilistic clusters that is most likely to

generate D using this data generation process. An important question remaining is how

we can measure the likelihood that a set of k probabilistic clusters and their probabilities

will generate an observed data set.

Consider a set, C , of k probabilistic clusters, C 1 ,

:::

, C k , with probability density

functions f 1 ,

:::

, f k , respectively, and their probabilities,

! 1 ,

:::

! k . For an object, o , the

probability that o is generated by cluster C j .

1 j k

is given by P

o j C j /D! j f j .

Therefore, the probability that o is generated by the set C of clusters is

k X

j D1 ! j f j .

o j C

(11.5)

Since the objects are assumed to have been generated independently, for a data set, D D

f o 1 ,

:::

, o n g, of n objects, we have

n Y

k X

j D1 ! j f j .

D j C

o i j C

o i /

(11.6)

i D1

Now, it is clear that the task of probabilistic model-based cluster analysis on a data

set, D , is to find a set C of k probabilistic clusters such that P

D j C

is maximized. Maxi-

mizing P

D j C

is often intractable because, in general, the probability density function

Data Mining: Concepts and Techniques

Search WWH ::

Custom Search

Home