Real-time Machine Learning with Spark Streaming - Machine Learning with Spark

Database Reference

In-Depth Information

Streaming K-means

MLlib also includes a streaming version of K-means clustering; this is called Streamin-

gKMeans . This model is an extension of the mini-batch K-means algorithm where the

model is updated with each batch based on a combination between the cluster centers com-

puted from the previous batches and the cluster centers computed for the current batch.

StreamingKMeans supports a forgetfulness parameter alpha (set using the setDe-

cayFactor method); this controls how aggressive the model is in giving weight to newer

data. An alpha value of 0 means the model will only use new data, while with an alpha

value of 1 , all data since the beginning of the streaming application will be used.

We will not cover streaming K-means further here (the Spark documentation at ht-

ther detail and an example). However, perhaps you could try to adapt the preceding stream-

ing regression data producer to generate input data for a StreamingKMeans model. You

could also adapt the streaming regression application to use StreamingKMeans .

You can create the clustering data producer by first selecting a number of clusters, K , and

then generating each data point by:

• Randomly selecting a cluster index.

• Generating a random vector using specific normal distribution parameters for each

cluster. That is, each of the K clusters will have a mean and variance parameter,

from which the random vectors will be generated using an approach similar to our

preceding generateRandomArray function.

In this way, each data point that belongs to the same cluster will be drawn from the same

distribution, so our streaming clustering model should be able to learn the correct cluster

centers over time.

Search WWH ::

Custom Search

Home