Databases Reference
In-Depth Information
question depends more on the objective of the cluster analysis than on any
predefined cluster definition.
Algorithms for Cluster Analysis
One of the first and more simple, yet still widely used, clustering algorithms is
K-means. Clusters are identified by the algorithm based on proximity. It uses
the concept of a centroid which is defined as the mean of a group of points. In a
dataset defined in n dimensions, that is with n attributes or columns, each
centroid is assigned a value to each of the n dimensions. Before beginning a
cluster analysis using K-means, the analyst must first choose K - the number of
expected clusters.
The steps of the algorithm are:
1. Randomly locate K initial centroids within the n-dimensional space.
(Alternatively, randomly choose K observations from the dataset to serve
as the initial centroids.)
2. Repeat:
a. assign each of the observations in the dataset to the nearest centroid
b.
recompute each centroid's location as the mean of all observations
assigned to that centroid until observation assignments to centroids do
not change.
Issues with K-Means Clustering Process
Although K-means is simple to understand and implement,
it does have
shortcomings:
K, the number of clusters, must be set before initiating the process.
K-means generates a complete partitioning of the observations. There is no
option to exclude observations from the clustering.
When initial centroids are randomly located, the resulting clusterings
may vary from execution to execution. The end result
is not
deterministic.
K-means does not handle well datasets containing clusters of varying size. In
general, it will tend to split the larger clusters and may merge smaller
clusters.
 
Search WWH ::




Custom Search