Building a Regression Model with Spark - Machine Learning with Spark

Database Reference

In-Depth Information

Tuning model parameters

So far in this chapter, we have illustrated the concepts of model training and evaluation for

MLlib's regression models by training and testing on the same dataset. We will now use a

similar cross-validation approach that we used previously to evaluate the effect on perform-

ance of different parameter settings for our models.

Creating training and testing sets to evaluate parameters

The first step is to create a test and training set for cross-validation purposes. Spark's

Python API does not yet provide the randomSplit convenience method that is available

in Scala. Hence, we will need to create a training and test dataset manually.

One relatively easy way to do this is by first taking a random sample of, say, 20 percent of

our data as our test set. We will then define our training set as the elements of the original

RDD that are not in the test set RDD.

We can achieve this using the sample method to take a random sample for our test set,

followed by using the subtractByKey method, which takes care of returning the ele-

ments in one RDD where the keys do not overlap with the other RDD.

Note that subtractByKey , as the name suggests, works on the keys of the RDD ele-

ments that consist of key-value pairs. Therefore, here we will use zipWithIndex on our

RDD of extracted training examples. This creates an RDD of (LabeledPoint, in-

dex) pairs.

We will then reverse the keys and values so that we can operate on the index keys:

data_with_idx = data.zipWithIndex().map(lambda (k, v): (v,

k))

test = data_with_idx.sample(False, 0.2, 42)

train = data_with_idx.subtractByKey(test)

Once we have the two RDDs, we will recover just the LabeledPoint instances we need

for training and test data, using map to extract the value from the key-value pairs:

train_data = train.map(lambda (idx, p): p)

test_data = test.map(lambda (idx, p) : p)

train_size = train_data.count()

Search WWH ::

Custom Search

Home