Machine Learning with MLlib - Learning Spark

Database Reference

In-Depth Information

Machine Learning Basics

To put the functions in MLlib in context, we'll start with a brief review of machine

learning concepts.

Machine learning algorithms attempt to make predictions or decisions based on

training data , often maximizing a mathematical objective about how the algorithm

should behave. There are multiple types of learning problems, including classifica‐

tion, regression, or clustering, which have different objectives. As a simple example,

we'll consider classification , which involves identifying which of several categories an

item belongs to (e.g., whether an email is spam or non-spam), based on labeled

examples of other items (e.g., emails known to be spam or not).

All learning algorithms require defining a set of features for each item, which will be

fed into the learning function. For example, for an email, some features might

include the server it comes from, or the number of mentions of the word free , or the

color of the text. In many cases, defining the right features is the most challenging

part of using machine learning. For example, in a product recommendation task,

simply adding another feature (e.g., realizing that which topic you should recom‐

mend to a user might also depend on which movies she's watched) could give a large

improvement in results.

Most algorithms are defined only for numerical features (specifically, a vector of

numbers representing the value for each feature), so often an important step is fea‐

ture extraction and transformation to produce these feature vectors. For example, for

text classification (e.g., our spam versus non-spam case), there are several methods to

featurize text, such as counting the frequency of each word.

Once data is represented as feature vectors, most machine learning algorithms opti‐

mize a well-defined mathematical function based on these vectors. For example, one

classification algorithm might be to define the plane (in the space of feature vectors)

that “best” separates the spam versus non-spam examples, according to some defini‐

tion of “best” (e.g., the most points classified correctly by the plane). At the end, the

algorithm will return a model representing the learning decision (e.g., the plane

chosen). This model can now be used to make predictions on new points (e.g., see

which side of the plane the feature vector for a new email falls on, in order to decide

whether it's spam). Figure 11-1 shows an example learning pipeline.

Search WWH ::

Custom Search

Home