Designing a Machine Learning System - Machine Learning with Spark

Database Reference

In-Depth Information

Model training and testing loop

Once we have our training data in a form that is suitable for our model, we can proceed

with the model's training and testing phase. During this phase, we are primarily concerned

with model selection . This can refer to choosing the best modeling approach for our task,

or the best parameter settings for a given model. In fact, the term model selection often

refers to both of these processes, as, in many cases, we might wish to try out various mod-

els and select the best performing model (with the best performing parameter settings for

each model). It is also common to explore the application of combinations of different

models (known as ensemble methods ) in this phase.

This is typically a fairly straightforward process of running our chosen model on our train-

ing dataset and testing its performance on a test dataset (that is, a set of data that is held out

for the evaluation of the model that the model has not seen in the training phase). This pro-

cess is referred to as cross-validation .

However, due to the large scale of data we are typically working with, it is often useful to

carry out this initial train-test loop on a smaller representative sample of our full dataset or

perform model selection using parallel methods where possible.

For this part of the pipeline, Spark's built-in machine learning library, MLlib, is a perfect

fit. We will focus most of our attention in this topic on the model training, evaluation, and

cross-validation steps for various machine learning techniques, using MLlib and Spark's

core features.

Search WWH ::

Custom Search

Home