Database Reference
In-Depth Information
Tuning model parameters
So far in this chapter, we have illustrated the concepts of model training and evaluation for
MLlib's regression models by training and testing on the same dataset. We will now use a
similar cross-validation approach that we used previously to evaluate the effect on perform-
ance of different parameter settings for our models.
Creating training and testing sets to evaluate parameters
The first step is to create a test and training set for cross-validation purposes. Spark's
Python API does not yet provide the randomSplit convenience method that is available
in Scala. Hence, we will need to create a training and test dataset manually.
One relatively easy way to do this is by first taking a random sample of, say, 20 percent of
our data as our test set. We will then define our training set as the elements of the original
RDD that are not in the test set RDD.
We can achieve this using the sample method to take a random sample for our test set,
followed by using the subtractByKey method, which takes care of returning the ele-
ments in one RDD where the keys do not overlap with the other RDD.
Note that subtractByKey , as the name suggests, works on the keys of the RDD ele-
ments that consist of key-value pairs. Therefore, here we will use zipWithIndex on our
RDD of extracted training examples. This creates an RDD of (LabeledPoint, in-
dex) pairs.
We will then reverse the keys and values so that we can operate on the index keys:
data_with_idx = data.zipWithIndex().map(lambda (k, v): (v,
k))
test = data_with_idx.sample(False, 0.2, 42)
train = data_with_idx.subtractByKey(test)
Once we have the two RDDs, we will recover just the LabeledPoint instances we need
for training and test data, using map to extract the value from the key-value pairs:
train_data = train.map(lambda (idx, p): p)
test_data = test.map(lambda (idx, p) : p)
train_size = train_data.count()
Search WWH ::




Custom Search