Dimensionality Reduction with Spark - Machine Learning with Spark

Database Reference

In-Depth Information

Chapter 8. Dimensionality Reduction with

Spark

Over the course of this chapter, we will continue our exploration of unsupervised learning

models in the form of dimensionality reduction .

Unlike the models we have covered so far, such as regression, classification, and clustering,

dimensionality reduction does not focus on making predictions. Instead, it tries to take a set

of input data with a feature dimension D (that is, the length of our feature vector) and ex-

tract a representation of the data of dimension k , where k is usually significantly smaller

than D . It is, therefore, a form of preprocessing or feature transformation rather than a pre-

dictive model in its own right.

It is important that the representation that is extracted should still be able to capture a large

proportion of the variability or structure of the original data. The idea behind this is that

most data sources will contain some form of underlying structure. This structure is typic-

ally unknown (often called latent features or latent factors), but if we can uncover some of

this structure, our models could learn this structure and make predictions from it rather than

from the data in its raw form, which might be noisy or contain many irrelevant features. In

other words, dimensionality reduction throws away some of the noise in the data and keeps

the hidden structure that is present.

In some cases, the dimensionality of the raw data is far higher than the number of data

points we have, so without dimensionality reduction, it would be difficult for other machine

learning models, such as classification and regression, to learn anything, as they need to fit

a number of parameters that is far larger than the number of training examples (in this

sense, these methods bear some similarity to the regularization approaches that we have

seen used in classification and regression).

A few use cases of dimensionality reduction techniques include:

• Exploratory data analysis

• Extracting features to train other machine learning models

• Reducing storage and computation requirements for very large models in the pre-

diction phase (for example, a production system that makes predictions)

• Reducing a large group of text documents down to a set of hidden topics or con-

cepts

Search WWH ::

Custom Search

Home