Cluster Analysis - Visual Data Mining: The VisMiner Approach

Databases Reference

In-Depth Information

question depends more on the objective of the cluster analysis than on any

predefined cluster definition.

Algorithms for Cluster Analysis

One of the first and more simple, yet still widely used, clustering algorithms is

K-means. Clusters are identified by the algorithm based on proximity. It uses

the concept of a centroid which is defined as the mean of a group of points. In a

dataset defined in n dimensions, that is with n attributes or columns, each

centroid is assigned a value to each of the n dimensions. Before beginning a

cluster analysis using K-means, the analyst must first choose K - the number of

expected clusters.

The steps of the algorithm are:

1. Randomly locate K initial centroids within the n-dimensional space.

(Alternatively, randomly choose K observations from the dataset to serve

as the initial centroids.)

2. Repeat:

a. assign each of the observations in the dataset to the nearest centroid

b.

recompute each centroid's location as the mean of all observations

assigned to that centroid until observation assignments to centroids do

not change.

Issues with K-Means Clustering Process

Although K-means is simple to understand and implement,

it does have

shortcomings:

K, the number of clusters, must be set before initiating the process.

K-means generates a complete partitioning of the observations. There is no

option to exclude observations from the clustering.

When initial centroids are randomly located, the resulting clusterings

may vary from execution to execution. The end result

is not

deterministic.

K-means does not handle well datasets containing clusters of varying size. In

general, it will tend to split the larger clusters and may merge smaller

clusters.

Search WWH ::

Custom Search

Home