Database Reference
In-Depth Information
CHAPTER 11
Machine Learning with MLlib
MLlib is Spark's library of machine learning functions. Designed to run in parallel on
clusters, MLlib contains a variety of learning algorithms and is accessible from all of
Spark's programming languages. This chapter will show you how to call it in your
own programs, and offer common usage tips.
Machine learning itself is a topic large enough to fill many topics, 1 so unfortunately,
in this chapter, we will not have the space to explain machine learning in detail. If
you are familiar with machine learning, however, this chapter will explain how to use
it in Spark; and even if you are new to it, you should be able to combine the material
here with other introductory material. This chapter is most relevant to data scientists
with a machine learning background looking to use Spark, as well as engineers work‐
ing with a machine learning expert.
Overview
MLlib's design and philosophy are simple: it lets you invoke various algorithms on
distributed datasets, representing all data as RDDs. MLlib introduces a few data types
(e.g., labeled points and vectors), but at the end of the day, it is simply a set of func‐
tions to call on RDDs. For example, to use MLlib for a text classification task (e.g.,
identifying spammy emails), you might do the following:
1. Start with an RDD of strings representing your messages.
2. Run one of MLlib's feature extraction algorithms to convert text into numerical
features (suitable for learning algorithms); this will give back an RDD of vectors.
1 Some examples from O'Reilly include Machine Learning with R and Machine Learning for Hackers .
 
Search WWH ::




Custom Search