Database Reference
In-Depth Information
12
Large-Scale Machine Learning
Many algorithms are today classified as “machine learning.” These algorithms share, with
the other algorithms studied in this topic, the goal of extracting information from data. All
algorithms for analysis of data are designed to produce a useful summary of the data, from
which decisions are made. Among many examples, the frequent-itemset analysis that we did
in Chapter 6 produces information like association rules, which can then be used for plan-
ning a sales strategy or for many other purposes.
However, algorithms called “machine learning” not only summarize our data; they are
perceived as learning a model or classifier from the data, and thus discover something about
data that will be seen in the future. For instance, the clustering algorithms discussed in
Chapter 7 produce clusters that not only tell us something about the data being analyzed
(the training set), but they allow us to classify future data into one of the clusters that result
from the clustering algorithm. Thus, machine-learning enthusiasts often speak of clustering
with the neologism “unsupervised learning”; the term unsupervised refers to the fact that the
input data does not tell the clustering algorithm what the clusters should be. In supervised
machine learning, which is the subject of this chapter, the available data includes informa-
tion about the correct way to classify at least some of the data. The data already classified is
called the training set .
In this chapter, we do not attempt to cover all the different approaches to machine learn-
ing. We concentrate on methods that are suitable for very large data and that have the poten-
tial for parallel implementation. We consider the classical “perceptron” approach to learning
a data classifier, where a hyperplane that separates two classes is sought. Then, we look at
more modern techniques involving support-vector machines. Similar to perceptrons, these
methods look for hyperplanes that best divide the classes, so that few, if any, members of the
training set lie close to the hyperplane. We end with a discussion of nearest-neighbor tech-
niques, where data is classified according to the class(es) of their nearest neighbors in some
space.
Search WWH ::




Custom Search