Database Reference
In-Depth Information
Chapter 8. Dimensionality Reduction with
Spark
Over the course of this chapter, we will continue our exploration of unsupervised learning
models in the form of dimensionality reduction .
Unlike the models we have covered so far, such as regression, classification, and clustering,
dimensionality reduction does not focus on making predictions. Instead, it tries to take a set
of input data with a feature dimension D (that is, the length of our feature vector) and ex-
tract a representation of the data of dimension k , where k is usually significantly smaller
than D . It is, therefore, a form of preprocessing or feature transformation rather than a pre-
dictive model in its own right.
It is important that the representation that is extracted should still be able to capture a large
proportion of the variability or structure of the original data. The idea behind this is that
most data sources will contain some form of underlying structure. This structure is typic-
ally unknown (often called latent features or latent factors), but if we can uncover some of
this structure, our models could learn this structure and make predictions from it rather than
from the data in its raw form, which might be noisy or contain many irrelevant features. In
other words, dimensionality reduction throws away some of the noise in the data and keeps
the hidden structure that is present.
In some cases, the dimensionality of the raw data is far higher than the number of data
points we have, so without dimensionality reduction, it would be difficult for other machine
learning models, such as classification and regression, to learn anything, as they need to fit
a number of parameters that is far larger than the number of training examples (in this
sense, these methods bear some similarity to the regularization approaches that we have
seen used in classification and regression).
A few use cases of dimensionality reduction techniques include:
• Exploratory data analysis
• Extracting features to train other machine learning models
• Reducing storage and computation requirements for very large models in the pre-
diction phase (for example, a production system that makes predictions)
• Reducing a large group of text documents down to a set of hidden topics or con-
cepts
Search WWH ::




Custom Search