Machine Learning with MLlib - Learning Spark

Database Reference

In-Depth Information

To achieve the decomposition, we call computeSVD on the RowMatrix class, as shown

in Example 11-14 .

Example 11-14. SVD in Scala

// Compute the top 20 singular values of a RowMatrix mat and their singular vectors.

val svd : SingularValueDecomposition [ RowMatrix , Matrix ] =

mat . computeSVD ( 20 , computeU =true )

val U : RowMatrix = svd . U // U is a distributed RowMatrix.

val s : Vector = svd . s // Singular values are a local dense vector.

val V : Matrix = svd . V // V is a local dense matrix.

Model Evaluation

No matter what algorithm is used for a machine learning task, model evaluation is an

important part of end-to-end machine learning pipelines. Many learning tasks can be

tackled with different models, and even for the same algorithm, parameter settings

can lead to different results. Moreover, there is always a risk of overfitting a model to

the training data, which you can best evaluate by testing the model on a different

dataset than the training data.

At the time of writing (for Spark 1.2), MLlib contains an experimental set of model

evaluation functions, though only in Java and Scala. These are available in the

mllib.evaluation package, in classes such as BinaryClassificationMetrics and

MulticlassMetrics , depending on the problem. For these classes, you can create a

Metrics object from an RDD of (prediction, ground truth) pairs, and then compute

metrics such as precision, recall, and area under the receiver operating characteristic

(ROC) curve. These methods should be run on a test dataset not used for training

(e.g., by leaving out 20% of the data before training). You can apply your model to

the test dataset in a map() function to build the RDD of (prediction, ground truth)

pairs.

In future versions of Spark, the pipeline API at the end of this chapter is expected to

include evaluation functions in all languages. With the pipeline API, you can define a

pipeline of ML algorithms and an evaluation metric, and automatically have the sys‐

tem search for parameters and pick the best model using cross-validation.

Tips and Performance Considerations

Preparing Features

While machine learning presentations often put significant emphasis on the algo‐

rithms used, it is important to remember that in practice, each algorithm is only as

good as the features you put into it! Many large-scale learning practitioners agree that

Search WWH ::

Custom Search

Home