Clustering - Mining of Massive Datasets

Databases Reference

In-Depth Information

F

The BFR Algorithm: This algorithm is a version of k-means designed to

handle data that is too large to fit in main memory. It assumes clusters

are normally distributed about the axes.

F

Representing Clusters in BFR: Points are read from disk one chunk at a

time. Clusters are represented in main memory by the count of the num-

ber of points, the vector sum of all the points, and the vector formed by

summing the squares of the components of the points in each dimension.

Other collection of points, too far from a cluster centroid to be included

in a cluster, are represented as “miniclusters” in the same way as the k

clusters, while still other points, which are not near any other point will

be represented as themselves and called “retained” points.

F

Processing Points in BFR: Most of the points in a main-memory load

will be assigned to a nearby cluster and the parameters for that cluster

will be adjusted to account for the new points. Unassigned points can

be formed into new miniclusters, and these miniclusters can be merged

with previously discovered miniclusters or retained points. After the last

memory load, the miniclusters and retained points can be merged to their

nearest cluster or kept as outliers.

F

The CURE Algorithm: This algorithm is of the point-assignment type.

It is designed for a Euclidean space, but clusters can have any shape. It

handles data that is too large to fit in main memory.

F

Representing Clusters in CURE : The algorithm begins by clustering a

small sample of points. It then selects representative points for each

cluster, by picking points in the cluster that are as far away from each

other as possible. The goal is to find representative points on the fringes of

the cluster. However, the representative points are then moved a fraction

of the way toward the centroid of the cluster, so they lie somewhat in the

interior of the cluster.

F

Processing Points in CURE : After creating representative points for each

cluster, the entire set of points can be read from disk and assigned to a

cluster. We assign a given point to the cluster of the representative point

that is closest to the given point.

F

The GRGPF Algorithm: This algorithm is of the point-assignment type.

It handles data that is too big to fit in main memory, and it does not

assume a Euclidean space.

F

Representing Clusters in GRGPF : A cluster is represented by the count

of points in the cluster, the clustroid, a set of points nearest the clustroid

and a set of points furthest from the clustroid. The nearby points allow

us to change the clustroid if the cluster evolves, and the distant points

allow for merging clusters e ciently in appropriate circumstances. For

each of these points, we also record the rowsum, that is the square root

Mining of Massive Datasets

Search WWH ::

Custom Search

Home