Real-time Machine Learning with Spark Streaming - Machine Learning with Spark

Database Reference

In-Depth Information

Vectors.dense(features))

}

The final step is to tell the model to train and test on our transformed DStream and also to

print out the first few elements of each batch in the DStream of predicted values:

// train and test model on the stream, and print

predictions

// for illustrative purposes

model.trainOn(labeledStream)

model.predictOn(labeledStream).print()

ssc.start()

ssc.awaitTermination()

}

Tip

Note that because we are using the same MLlib model classes for streaming as we did for

batch processing, we can, if we choose, perform multiple iterations over the training data

in each batch (which is just an RDD of LabeledPoint instances).

Here, we will set the number of iterations to 1 to simulate purely online learning. In prac-

tice, you can set the number of iterations higher, but note that the training time per batch

will go up. If the training time per batch is much higher than the batch interval, the

streaming model will start to lag behind the velocity of the data stream.

This can be handled by decreasing the number of iterations, increasing the batch interval,

or increasing the parallelism of our streaming program by adding more Spark workers.

Now, we're ready to run SimpleStreamingModel in our second terminal window us-

ing sbt run in the same way as we did for the producer (remember to select the correct

main method for SBT to execute). Once the streaming program starts running, you should

see the following output in the producer console:

Got client connected from: /127.0.0.1

...

Created 10 events...

Search WWH ::

Custom Search

Home