Database Reference
In-Depth Information
Streaming K-means
MLlib also includes a streaming version of K-means clustering; this is called
Streamin-
gKMeans
. This model is an extension of the mini-batch K-means algorithm where the
model is updated with each batch based on a combination between the cluster centers com-
puted from the previous batches and the cluster centers computed for the current batch.
StreamingKMeans
supports a
forgetfulness
parameter
alpha
(set using the
setDe-
cayFactor
method); this controls how aggressive the model is in giving weight to newer
data. An alpha value of 0 means the model will only use new data, while with an alpha
value of
1
, all data since the beginning of the streaming application will be used.
We will not cover streaming K-means further here (the Spark documentation at
ht-
ther detail and an example). However, perhaps you could try to adapt the preceding stream-
ing regression data producer to generate input data for a
StreamingKMeans
model. You
could also adapt the streaming regression application to use
StreamingKMeans
.
You can create the clustering data producer by first selecting a number of clusters,
K
, and
then generating each data point by:
• Randomly selecting a cluster index.
• Generating a random vector using specific normal distribution parameters for each
cluster. That is, each of the
K
clusters will have a mean and variance parameter,
from which the random vectors will be generated using an approach similar to our
preceding
generateRandomArray
function.
In this way, each data point that belongs to the same cluster will be drawn from the same
distribution, so our streaming clustering model should be able to learn the correct cluster
centers over time.