Database Reference
In-Depth Information
algorithm integrates an internal outlier handling option for finding and specially
treating outliers. Another approach would be to run an initial clustering solution,
identify the small outlier clusters, and then rerun and fine-tune the analysis after
excluding the outlier clusters.
CLUSTERING WITH K-MEANS
K-means is one of the most popular clustering algorithms. It starts with an initial
cluster solution which is updated and adjusted until no further refinement is
possible (or until the iterations exceed a specified number). Each iteration refines
the solution by reducing the within-cluster variation.
The ''K'' in the algorithm's name comes from the fact that users should specify
in advance the number of k clusters to be formed. The ''means'' part of the name
refers to the fact that each cluster is represented by the means of its records on
the clustering fields, a point referred to as the cluster central point or centroid or
cluster center.
More specifically, unlike other clustering techniques (like TwoStep, for
instance) which automatically determine the number of the underlying clusters
through an automatic clustering procedure, K-means requires the user to specify
in advance the number of clusters to be formed ( k clusters). Therefore, this model
requires a lot of trial and error and experimental modeling. Analysts should also use
other automatic clustering models to get an indication of the number of clusters.
Unavoidably, the quest for the optimal solution should involve preliminary analysis,
multiple runs, and evaluation of different solutions.
K-means uses the Euclidean distance measure and iteratively assigns each
record in the derived clusters. It is a very quick algorithm since it does not need to
calculate the distances between all pairs of records. Clusters are refined through
an iterative procedure during which records move between the clusters until
the procedure becomes stable. The procedure starts by selecting k well-spaced
initial records as cluster centers (initial seeds) and assigns each record to its
''nearest'' cluster. As new records are added to the clusters, the cluster centers
are recalculated to reflect their new members. Then, cases are reassigned to the
adjusted clusters. This iterative procedure is repeated until it converges and the
migration of records between clusters no longer refines the solution.
The K-means procedure is graphically illustrated in Figure 3.5. For easier
interpretation the graph refers to two input clustering fields and a two-dimensional
clustering solution.
One drawback of this algorithm is that, to some extent, it depends on the
initially selected cluster centers and the order of the input data. It is recommended
that data are reordered randomly before model training. An even better approach
Search WWH ::




Custom Search