Databases Reference
In-Depth Information
,
C
k
, through cluster analysis.
For a data set,
D
, of
n
objects, we can regard
D
as a finite sample of the possible instances
of the clusters. Conceptually, we can assume that
D
is formed as follows. Each cluster,
C
j
.
Suppose we want to find
k
probabilistic clusters,
C
1
,
:::
1
j
k
/
, is associated with a probability,
!
j
, that some instance is sampled from
the cluster. It is often assumed that
!
1
,
:::
,
!
k
are given as part of the problem setting,
and that
P
j
D1
!
j
D 1, which ensures that all objects are generated by the
k
clusters.
Here, parameter
!
j
captures background knowledge about the relative population of
cluster
C
j
.
We then run the following two steps to generate an object in
D
. The steps are executed
n
times in total to generate
n
objects,
o
1
,
:::
,
o
n
, in
D
.
1.
Choose a cluster,
C
j
, according to probabilities
!
k
.
2.
Choose an instance of
C
j
according to its probability density function,
f
j
.
!
1
,
:::
,
The data generation process here is the basic assumption in mixture models. Formally,
a
mixture model
assumes that a set of observed objects is a mixture of instances from
multiple probabilistic clusters. Conceptually, each observed object is generated indepen-
dently by two steps: first choosing a probabilistic cluster according to the probabilities of
the clusters, and then choosing a sample according to the probability density function
of the chosen cluster.
Given data set,
D
, and
k
, the number of clusters required, the task of
probabilistic
model-based cluster analysis
is to infer a set of
k
probabilistic clusters that is most likely to
generate
D
using this data generation process. An important question remaining is how
we can measure the likelihood that a set of
k
probabilistic clusters and their probabilities
will generate an observed data set.
Consider a set,
C
, of
k
probabilistic clusters,
C
1
,
:::
,
C
k
, with probability density
functions
f
1
,
:::
,
f
k
, respectively, and their probabilities,
!
1
,
:::
,
!
k
. For an object,
o
, the
probability that
o
is generated by cluster
C
j
.
1
j
k
/
is given by
P
.
o
j
C
j
/D!
j
f
j
.
o
/
.
Therefore, the probability that
o
is generated by the set
C
of clusters is
k
X
j
D1
!
j
f
j
.
P
.
o
j
C
/D
o
/
.
(11.5)
Since the objects are assumed to have been generated independently, for a data set,
D
D
f
o
1
,
:::
,
o
n
g, of
n
objects, we have
n
Y
n
Y
k
X
j
D1
!
j
f
j
.
P
.
D
j
C
/D
P
.
o
i
j
C
/D
o
i
/
.
(11.6)
i
D1
i
D1
Now, it is clear that the task of probabilistic model-based cluster analysis on a data
set,
D
, is to find a set
C
of
k
probabilistic clusters such that
P
.
D
j
C
/
is maximized. Maxi-
mizing
P
.
D
j
C
/
is often intractable because, in general, the probability density function