Database Reference
In-Depth Information
Algorithm 7 spkmeans
Require: Set X of data points on S
d− 1
Ensure: A disjoint k -partitioning {X h }
h =1 of X
Initialize μ h ,h =1 ,
···
,k
repeat
{
The E (Expectation) step of EM
}
Set
X h ←∅
,h =1 ,
···
,k
for i =1to n do
X h ←X h ∪{
x i μ h
x i }
where h =argmax
h
{
The M (Maximization) step of EM
}
for h =1to k do
μ h
x ∈X h x
x ∈X h x
until convergence .
4. moVMF based clustering using soft assignments— soft-moVMF .
It has already been established that kmeans using Euclidean distance per-
forms much worse than spkmeans for text data (49), so we do not consider
it here. Generative model based algorithms that use mixtures of Bernoulli
or multinomial distributions, which have been shown to perform well for text
datasets, have also not been included in the experiments. This exclusion is
done as a recent empirical study over 15 text datasets showed that simple ver-
sionsofvMFmixturemodels(with κ constant for all clusters) outperform the
multinomial model except for only one dataset (Classic3), and the Bernoulli
model was inferior for all datasets (56). Further, for certain datasets, we
compare clustering performance with latent Dirichlet allocation (LDA) (12)
and exponential family approximation of Dirichlet compounded multinomial
(EDCM) models (23).
6.7.1 Datasets
The datasets that we used for empirical validation and comparison of our
algorithms were carefully selected to represent some typical clustering prob-
lems. We also created various subsets of some of the datasets for gaining
greater insight into the nature of clusters discovered or to model some partic-
ular clustering scenario (e.g., balanced clusters, skewed clusters, overlapping
clusters, etc.). We drew our data from five sources: Simulated, Classic3, Ya-
hoo News, 20 Newsgroups, and Slashdot. For all the text document datasets,
the toolkit MC (17) was used for creating a high-dimensional vector space
model that each of the four algorithms utilized. Matlab code was used to
render the input as a vector space for the simulated datasets.
 
Search WWH ::




Custom Search