Database Reference
In-Depth Information
test_size = test_data.count()
print "Training data size: %d" % train_size
print "Test data size: %d" % test_size
print "Total data size: %d " % num_data
print "Train + Test size : %d" % (train_size + test_size)
We can confirm that we now have two distinct datasets that add up to the original dataset
in total:
Training data size: 13934
Test data size: 3445
Total data size: 17379
Train + Test size : 17379
The final step is to apply the same approach to the features extracted for the decision tree
model:
data_with_idx_dt = data_dt.zipWithIndex().map(lambda (k,
v): (v, k))
test_dt = data_with_idx_dt.sample(False, 0.2, 42)
train_dt = data_with_idx_dt.subtractByKey(test_dt)
train_data_dt = train_dt.map(lambda (idx, p): p)
test_data_dt = test_dt.map(lambda (idx, p) : p)
The impact of parameter settings for linear models
Now that we have prepared our training and test sets, we are ready to investigate the im-
pact of different parameter settings on model performance. We will first carry out this
evaluation for the linear model. We will create a convenience function to evaluate the rel-
evant performance metric by training the model on the training set and evaluating it on the
test set for different parameter settings.
We will use the RMSLE evaluation metric, as it is the one used in the Kaggle competition
with this dataset, and this allows us to compare our model results against the competition
leaderboard to see how we perform.
The evaluation function is defined here:
def evaluate(train, test, iterations, step, regParam,
regType, intercept):
Search WWH ::




Custom Search