Database Reference
In-Depth Information
Streaming K-means
MLlib also includes a streaming version of K-means clustering; this is called Streamin-
gKMeans . This model is an extension of the mini-batch K-means algorithm where the
model is updated with each batch based on a combination between the cluster centers com-
puted from the previous batches and the cluster centers computed for the current batch.
StreamingKMeans supports a forgetfulness parameter alpha (set using the setDe-
cayFactor method); this controls how aggressive the model is in giving weight to newer
data. An alpha value of 0 means the model will only use new data, while with an alpha
value of 1 , all data since the beginning of the streaming application will be used.
We will not cover streaming K-means further here (the Spark documentation at ht-
tp://spark.apache.org/docs/latest/mllib-clustering.html#streaming-clustering contains fur-
ther detail and an example). However, perhaps you could try to adapt the preceding stream-
ing regression data producer to generate input data for a StreamingKMeans model. You
could also adapt the streaming regression application to use StreamingKMeans .
You can create the clustering data producer by first selecting a number of clusters, K , and
then generating each data point by:
• Randomly selecting a cluster index.
• Generating a random vector using specific normal distribution parameters for each
cluster. That is, each of the K clusters will have a mean and variance parameter,
from which the random vectors will be generated using an approach similar to our
preceding generateRandomArray function.
In this way, each data point that belongs to the same cluster will be drawn from the same
distribution, so our streaming clustering model should be able to learn the correct cluster
centers over time.
Search WWH ::




Custom Search