Clustering - Mining of Massive Datasets

Database Reference

In-Depth Information

✦ Choosing the Clustroid : There are many ways we can define a typical point of a cluster in a non-Euclidean space.

For example, we could choose the point with the smallest sum of distances to the other points, the smallest sum of

the squares of those distances, or the smallest maximum distance to any other point in the cluster.

✦ Radius and Diameter : Whether or not the space is Euclidean, we can define the radius of a cluster to be the maximum

distance from the centroid or clustroid to any point in that cluster. We can define the diameter of the cluster to be the

maximum distance between any two points in the cluster. Alternative definitions, especially of the radius, are also

known, for example, average distance from the centroid to the other points.

✦ Hierarchical Clustering : This family of algorithms has many variations, which differ primarily in two areas. First,

we may chose in various ways which two clusters to merge next. Second, we may decide when to stop the merge

process in various ways.

✦ Picking Clusters to Merge : One strategy for deciding on the best pair of clusters to merge in a hierarchical clustering

is to pick the clusters with the closest centroids or clustroids. Another approach is to pick the pair of clusters with

the closest points, one from each cluster. A third approach is to use the average distance between points from the

two clusters.

✦ Stopping the Merger Process : A hierarchical clustering can proceed until there are a fixed number of clusters left.

Alternatively, we could merge until it is impossible to find a pair of clusters whose merger is sufficiently compact,

e.g., the merged cluster has a radius or diameter below some threshold. Another approach involves merging as long

as the resulting cluster has a sufficiently high “density,” which can be defined in various ways, but is the number of

points divided by some measure of the size of the cluster, e.g., the radius.

✦ K-Means Algorithms : This family of algorithms is of the point-assignment type and assumes a Euclidean space. It

is assumed that there are exactly k clusters for some known k . After picking k initial cluster centroids, the points are

considered one at a time and assigned to the closest centroid. The centroid of a cluster can migrate during point as-

signment, and an optional last step is to reassign all the points, while holding the centroids fixed at their final values

obtained during the first pass.

✦ Initializing K-Means Algorithms : One way to find k initial centroids is to pick a random point, and then choose k −

1 additional points, each as far away as possible from the previously chosen points. An alternative is to start with a

small sample of points and use a hierarchical clustering to merge them into k clusters.

✦ Picking K in a K-Means Algorithm : If the number of clusters is unknown, we can use a binary-search technique, try-

ing a k -means clustering with different values of k . We search for the largest value of k for which a decrease below

k clusters results in a radically higher average diameter of the clusters. This search can be carried out in a number of

clustering operations that is logarithmic in the true value of k .

✦ The BFR Algorithm : This algorithm is a version of k -means designed to handle data that is too large to fit in main

memory. It assumes clusters are normally distributed about the axes.

✦ Representing Clusters in BFR : Points are read from disk one chunk at a time. Clusters are represented in main

memory by the count of the number of points, the vector sum of all the points, and the vector formed by summing

the squares of the components of the points in each dimension. Other collection of points, too far from a cluster

centroid to be included in a cluster, are represented as “miniclusters” in the same way as the k clusters, while still

other points, which are not near any other point will be represented as themselves and called “retained” points.

✦ Processing Points in BFR : Most of the points in a main-memory load will be assigned to a nearby cluster and the

parameters for that cluster will be adjusted to account for the new points. Unassigned points can be formed into new

miniclusters, and these miniclusters can be merged with previously discovered miniclusters or retained points. After

the last memory load, the miniclusters and retained points can be merged to their nearest cluster or kept as outliers.

✦ The CURE Algorithm : This algorithm is of the point-assignment type. It is designed for a Euclidean space, but

clusters can have any shape. It handles data that is too large to fit in main memory.

✦ Representing Clusters in CURE : The algorithm begins by clustering a small sample of points. It then selects repres-

entative points for each cluster, by picking points in the cluster that are as far away from each other as possible. The

goal is to find representative points on the fringes of the cluster. However, the representative points are then moved

a fraction of the way toward the centroid of the cluster, so they lie somewhat in the interior of the cluster.

Mining of Massive Datasets

Search WWH ::

Custom Search

Home