Database Reference
In-Depth Information
3. Call a classification algorithm (e.g., logistic regression) on the RDD of vectors;
this will give back a model object that can be used to classify new points.
4. Evaluate the model on a test dataset using one of MLlib's evaluation functions.
One important thing to note about MLlib is that it contains only parallel algorithms
that run well on clusters. Some classic ML algorithms are not included because they
were not designed for parallel platforms, but in contrast MLlib contains several
recent research algorithms for clusters, such as distributed random forests, K-
means||, and alternating least squares. This choice means that MLlib is best suited for
running each algorithm on a large dataset. If you instead have many small datasets on
which you want to train different learning models, it would be better to use a single-
node learning library (e.g., Weka or SciKit-Learn ) on each node, perhaps calling it in
parallel across nodes using a Spark map() . Likewise, it is common for machine learn‐
ing pipelines to require training the same algorithm on a small dataset with many
configurations of parameters, in order to choose the best one. You can achieve this in
Spark by using parallelize() over your list of parameters to train different ones on
different nodes, again using a single-node learning library on each node. But MLlib
itself shines when you have a large, distributed dataset that you need to train a model
on.
Finally, in Spark 1.0 and 1.1, MLlib's interface is relatively low-level, giving you the
functions to call for different tasks but not the higher-level workflow typically
required for a learning pipeline (e.g., splitting the input into training and test data, or
trying many combinations of parameters). In Spark 1.2, MLlib gains an additional
(and at the time of writing still experimental) pipeline API for building such pipe‐
lines. This API resembles higher-level libraries like SciKit-Learn, and will hopefully
make it easy to write complete, self-tuning pipelines. We include a preview of this
API at the end of this chapter, but focus primarily on the lower-level APIs.
System Requirements
MLlib requires some linear algebra libraries to be installed on your machines. First,
you will need the gfortran runtime library for your operating system. If MLlib warns
that gfortran is missing, follow the setup instructions on the MLlib website . Second,
to use MLlib in Python, you will need NumPy . If your Python installation does not
have it (i.e., you cannot import numpy ), the easiest way to get it is by installing the
python-numpy or numpy package through your package manager on Linux, or by
using a third-party scientific Python installation like Anaconda .
MLlib's supported algorithms have also evolved over time. The ones we discuss here
are all available as of Spark 1.2, but some of the algorithms may not be present in
earlier versions.
Search WWH ::




Custom Search