Databases Reference
In-Depth Information
F
The BFR Algorithm: This algorithm is a version of k-means designed to
handle data that is too large to fit in main memory. It assumes clusters
are normally distributed about the axes.
F
Representing Clusters in BFR: Points are read from disk one chunk at a
time. Clusters are represented in main memory by the count of the num-
ber of points, the vector sum of all the points, and the vector formed by
summing the squares of the components of the points in each dimension.
Other collection of points, too far from a cluster centroid to be included
in a cluster, are represented as “miniclusters” in the same way as the k
clusters, while still other points, which are not near any other point will
be represented as themselves and called “retained” points.
F
Processing Points in BFR: Most of the points in a main-memory load
will be assigned to a nearby cluster and the parameters for that cluster
will be adjusted to account for the new points. Unassigned points can
be formed into new miniclusters, and these miniclusters can be merged
with previously discovered miniclusters or retained points. After the last
memory load, the miniclusters and retained points can be merged to their
nearest cluster or kept as outliers.
F
The CURE Algorithm: This algorithm is of the point-assignment type.
It is designed for a Euclidean space, but clusters can have any shape. It
handles data that is too large to fit in main memory.
F
Representing Clusters in CURE : The algorithm begins by clustering a
small sample of points. It then selects representative points for each
cluster, by picking points in the cluster that are as far away from each
other as possible. The goal is to find representative points on the fringes of
the cluster. However, the representative points are then moved a fraction
of the way toward the centroid of the cluster, so they lie somewhat in the
interior of the cluster.
F
Processing Points in CURE : After creating representative points for each
cluster, the entire set of points can be read from disk and assigned to a
cluster. We assign a given point to the cluster of the representative point
that is closest to the given point.
F
The GRGPF Algorithm: This algorithm is of the point-assignment type.
It handles data that is too big to fit in main memory, and it does not
assume a Euclidean space.
F
Representing Clusters in GRGPF : A cluster is represented by the count
of points in the cluster, the clustroid, a set of points nearest the clustroid
and a set of points furthest from the clustroid. The nearby points allow
us to change the clustroid if the cluster evolves, and the distant points
allow for merging clusters e ciently in appropriate circumstances. For
each of these points, we also record the rowsum, that is the square root
Search WWH ::




Custom Search