Dimensionality Reduction with Spark - Machine Learning with Spark

Database Reference

In-Depth Information

Clustering as dimensionality reduction

The clustering models we covered in the previous chapter can also be used for a form of di-

mensionality reduction. This works in the following way:

• Assume that we cluster our high-dimensional feature vectors using a K-means

clustering model, with k clusters. The result is a set of k cluster centers.

• We can represent each of our original data points in terms of how far it is from

each of these cluster centers. That is, we can compute the distance of a data point

to each cluster center. The result is a set of k distances for each data point.

• These k distances can form a new vector of dimension k . We can now represent our

original data as a new vector of lower dimension, relative to the original feature di-

mension.

Depending on the distance metric used, this can result in both dimensionality reduction and

a form of nonlinear transformation of the data, allowing us to learn a more complex model

while still benefiting from the speed and scalability of a linear model. For example, using a

Gaussian or exponential distance function can approximate a very complex nonlinear fea-

ture transformation.

Search WWH ::

Custom Search

Home