Database Reference
In-Depth Information
Machine Learning Basics
To put the functions in MLlib in context, we'll start with a brief review of machine
learning concepts.
Machine learning algorithms attempt to make predictions or decisions based on
training data , often maximizing a mathematical objective about how the algorithm
should behave. There are multiple types of learning problems, including classifica‐
tion, regression, or clustering, which have different objectives. As a simple example,
we'll consider classification , which involves identifying which of several categories an
item belongs to (e.g., whether an email is spam or non-spam), based on labeled
examples of other items (e.g., emails known to be spam or not).
All learning algorithms require defining a set of features for each item, which will be
fed into the learning function. For example, for an email, some features might
include the server it comes from, or the number of mentions of the word free , or the
color of the text. In many cases, defining the right features is the most challenging
part of using machine learning. For example, in a product recommendation task,
simply adding another feature (e.g., realizing that which topic you should recom‐
mend to a user might also depend on which movies she's watched) could give a large
improvement in results.
Most algorithms are defined only for numerical features (specifically, a vector of
numbers representing the value for each feature), so often an important step is fea‐
ture extraction and transformation to produce these feature vectors. For example, for
text classification (e.g., our spam versus non-spam case), there are several methods to
featurize text, such as counting the frequency of each word.
Once data is represented as feature vectors, most machine learning algorithms opti‐
mize a well-defined mathematical function based on these vectors. For example, one
classification algorithm might be to define the plane (in the space of feature vectors)
that “best” separates the spam versus non-spam examples, according to some defini‐
tion of “best” (e.g., the most points classified correctly by the plane). At the end, the
algorithm will return a model representing the learning decision (e.g., the plane
chosen). This model can now be used to make predictions on new points (e.g., see
which side of the plane the feature vector for a new email falls on, in order to decide
whether it's spam). Figure 11-1 shows an example learning pipeline.
Search WWH ::




Custom Search