Database Reference
In-Depth Information
Choosing the Clustroid : There are many ways we can define a typical point of a cluster in a non-Euclidean space.
For example, we could choose the point with the smallest sum of distances to the other points, the smallest sum of
the squares of those distances, or the smallest maximum distance to any other point in the cluster.
Radius and Diameter : Whether or not the space is Euclidean, we can define the radius of a cluster to be the maximum
distance from the centroid or clustroid to any point in that cluster. We can define the diameter of the cluster to be the
maximum distance between any two points in the cluster. Alternative definitions, especially of the radius, are also
known, for example, average distance from the centroid to the other points.
Hierarchical Clustering : This family of algorithms has many variations, which differ primarily in two areas. First,
we may chose in various ways which two clusters to merge next. Second, we may decide when to stop the merge
process in various ways.
Picking Clusters to Merge : One strategy for deciding on the best pair of clusters to merge in a hierarchical clustering
is to pick the clusters with the closest centroids or clustroids. Another approach is to pick the pair of clusters with
the closest points, one from each cluster. A third approach is to use the average distance between points from the
two clusters.
Stopping the Merger Process : A hierarchical clustering can proceed until there are a fixed number of clusters left.
Alternatively, we could merge until it is impossible to find a pair of clusters whose merger is sufficiently compact,
e.g., the merged cluster has a radius or diameter below some threshold. Another approach involves merging as long
as the resulting cluster has a sufficiently high “density,” which can be defined in various ways, but is the number of
points divided by some measure of the size of the cluster, e.g., the radius.
K-Means Algorithms : This family of algorithms is of the point-assignment type and assumes a Euclidean space. It
is assumed that there are exactly k clusters for some known k . After picking k initial cluster centroids, the points are
considered one at a time and assigned to the closest centroid. The centroid of a cluster can migrate during point as-
signment, and an optional last step is to reassign all the points, while holding the centroids fixed at their final values
obtained during the first pass.
Initializing K-Means Algorithms : One way to find k initial centroids is to pick a random point, and then choose k
1 additional points, each as far away as possible from the previously chosen points. An alternative is to start with a
small sample of points and use a hierarchical clustering to merge them into k clusters.
Picking K in a K-Means Algorithm : If the number of clusters is unknown, we can use a binary-search technique, try-
ing a k -means clustering with different values of k . We search for the largest value of k for which a decrease below
k clusters results in a radically higher average diameter of the clusters. This search can be carried out in a number of
clustering operations that is logarithmic in the true value of k .
The BFR Algorithm : This algorithm is a version of k -means designed to handle data that is too large to fit in main
memory. It assumes clusters are normally distributed about the axes.
Representing Clusters in BFR : Points are read from disk one chunk at a time. Clusters are represented in main
memory by the count of the number of points, the vector sum of all the points, and the vector formed by summing
the squares of the components of the points in each dimension. Other collection of points, too far from a cluster
centroid to be included in a cluster, are represented as “miniclusters” in the same way as the k clusters, while still
other points, which are not near any other point will be represented as themselves and called “retained” points.
Processing Points in BFR : Most of the points in a main-memory load will be assigned to a nearby cluster and the
parameters for that cluster will be adjusted to account for the new points. Unassigned points can be formed into new
miniclusters, and these miniclusters can be merged with previously discovered miniclusters or retained points. After
the last memory load, the miniclusters and retained points can be merged to their nearest cluster or kept as outliers.
The CURE Algorithm : This algorithm is of the point-assignment type. It is designed for a Euclidean space, but
clusters can have any shape. It handles data that is too large to fit in main memory.
Representing Clusters in CURE : The algorithm begins by clustering a small sample of points. It then selects repres-
entative points for each cluster, by picking points in the cluster that are as far away from each other as possible. The
goal is to find representative points on the fringes of the cluster. However, the representative points are then moved
a fraction of the way toward the centroid of the cluster, so they lie somewhat in the interior of the cluster.
Search WWH ::




Custom Search