Database Reference
In-Depth Information
To achieve the decomposition, we call computeSVD on the RowMatrix class, as shown
in Example 11-14 .
Example 11-14. SVD in Scala
// Compute the top 20 singular values of a RowMatrix mat and their singular vectors.
val svd : SingularValueDecomposition [ RowMatrix , Matrix ] =
mat . computeSVD ( 20 , computeU =true )
val U : RowMatrix = svd . U // U is a distributed RowMatrix.
val s : Vector = svd . s // Singular values are a local dense vector.
val V : Matrix = svd . V // V is a local dense matrix.
Model Evaluation
No matter what algorithm is used for a machine learning task, model evaluation is an
important part of end-to-end machine learning pipelines. Many learning tasks can be
tackled with different models, and even for the same algorithm, parameter settings
can lead to different results. Moreover, there is always a risk of overfitting a model to
the training data, which you can best evaluate by testing the model on a different
dataset than the training data.
At the time of writing (for Spark 1.2), MLlib contains an experimental set of model
evaluation functions, though only in Java and Scala. These are available in the
mllib.evaluation package, in classes such as BinaryClassificationMetrics and
MulticlassMetrics , depending on the problem. For these classes, you can create a
Metrics object from an RDD of (prediction, ground truth) pairs, and then compute
metrics such as precision, recall, and area under the receiver operating characteristic
(ROC) curve. These methods should be run on a test dataset not used for training
(e.g., by leaving out 20% of the data before training). You can apply your model to
the test dataset in a map() function to build the RDD of (prediction, ground truth)
pairs.
In future versions of Spark, the pipeline API at the end of this chapter is expected to
include evaluation functions in all languages. With the pipeline API, you can define a
pipeline of ML algorithms and an evaluation metric, and automatically have the sys‐
tem search for parameters and pick the best model using cross-validation.
Tips and Performance Considerations
Preparing Features
While machine learning presentations often put significant emphasis on the algo‐
rithms used, it is important to remember that in practice, each algorithm is only as
good as the features you put into it! Many large-scale learning practitioners agree that
Search WWH ::




Custom Search