Databases Reference
In-Depth Information
, C k , through cluster analysis.
For a data set, D , of n objects, we can regard D as a finite sample of the possible instances
of the clusters. Conceptually, we can assume that D is formed as follows. Each cluster,
C j .
Suppose we want to find k probabilistic clusters, C 1 ,
:::
1 j k
/
, is associated with a probability,
! j , that some instance is sampled from
the cluster. It is often assumed that
! 1 ,
:::
,
! k are given as part of the problem setting,
and that P j D1 ! j D 1, which ensures that all objects are generated by the k clusters.
Here, parameter
! j captures background knowledge about the relative population of
cluster C j .
We then run the following two steps to generate an object in D . The steps are executed
n times in total to generate n objects, o 1 ,
:::
, o n , in D .
1. Choose a cluster, C j , according to probabilities
! k .
2. Choose an instance of C j according to its probability density function, f j .
! 1 ,
:::
,
The data generation process here is the basic assumption in mixture models. Formally,
a mixture model assumes that a set of observed objects is a mixture of instances from
multiple probabilistic clusters. Conceptually, each observed object is generated indepen-
dently by two steps: first choosing a probabilistic cluster according to the probabilities of
the clusters, and then choosing a sample according to the probability density function
of the chosen cluster.
Given data set, D , and k , the number of clusters required, the task of probabilistic
model-based cluster analysis is to infer a set of k probabilistic clusters that is most likely to
generate D using this data generation process. An important question remaining is how
we can measure the likelihood that a set of k probabilistic clusters and their probabilities
will generate an observed data set.
Consider a set, C , of k probabilistic clusters, C 1 ,
:::
, C k , with probability density
functions f 1 ,
:::
, f k , respectively, and their probabilities,
! 1 ,
:::
,
! k . For an object, o , the
probability that o is generated by cluster C j .
1 j k
/
is given by P
.
o j C j /D! j f j .
o
/
.
Therefore, the probability that o is generated by the set C of clusters is
k X
j D1 ! j f j .
P
.
o j C
/D
o
/
.
(11.5)
Since the objects are assumed to have been generated independently, for a data set, D D
f o 1 ,
:::
, o n g, of n objects, we have
n Y
n Y
k X
j D1 ! j f j .
P
.
D j C
/D
P
.
o i j C
/D
o i /
.
(11.6)
i D1
i D1
Now, it is clear that the task of probabilistic model-based cluster analysis on a data
set, D , is to find a set C of k probabilistic clusters such that P
.
D j C
/
is maximized. Maxi-
mizing P
.
D j C
/
is often intractable because, in general, the probability density function
 
Search WWH ::




Custom Search