Real-time Machine Learning with Spark Streaming - Machine Learning with Spark

Database Reference

In-Depth Information

A simple streaming regression program

To illustrate the use of streaming regression, we will create a simple example similar to the

preceding one, which uses simulated data. We will write a producer program that generates

random feature vectors and target variables, given a fixed, known weight vector, and writes

each training example to a network stream.

Our consumer application will run a streaming regression model, training and then testing

on our simulated data stream. Our first example consumer will simply print its predictions

to the console.

Creating a streaming data producer

The data producer operates in a manner similar to our product event producer example.

Recall from Chapter 5 , Building a Classification Model with Spark , that a linear model is a

linear combination (or vector dot product) of a weight vector, w , and a feature vector, x

(that is, wTx ). Our producer will generate synthetic data using a fixed, known weight vector

and randomly generated feature vectors. This data fits the linear model formulation exactly,

so we will expect our regression model to learn the true weight vector fairly easily.

First, we will set up a maximum number of events per second (say, 100) and the number of

features in our feature vector (also 100 in this example):

/**

* A producer application that generates random linear

regression data.

*/

object StreamingModelProducer {

import breeze.linalg._

def main(args: Array[String]) {

// Maximum number of events per second

val MaxEvents = 100

val NumFeatures = 100

val random = new Random()

Search WWH ::

Custom Search

Home