Dimensionality Reduction - Mining of Massive Datasets

Database Reference

In-Depth Information

We can view PCA as a data-mining technique. The high-dimensional data can be re-

placed by its projection onto the most important axes. These axes are the ones correspond-

ing to the largest eigenvalues. Thus, the original data is approximated by data that has many

fewer dimensions and that summarizes well the original data.

11.2.1

An Illustrative Example

We shall start the exposition with a contrived and simple example. In this example, the data

is two-dimensional, a number of dimensions that is too small to make PCA really useful.

Moreover, the data, shown in Fig. 11.1 has only four points, and they are arranged in a

simple pattern along the 45-degree line to make our calculations easy to follow. That is,

to anticipate the result, the points can best be viewed as lying along the axis that is at a

45-degree angle, with small deviations in the perpendicular direction.

Figure 11.1 Four points in a two-dimensional space

To begin, let us represent the points by a matrix M with four rows - one for each point -

and two columns, corresponding to the x -axis and y -axis. This matrix is

Compute M T M , which is

We may find the eigenvalues of the matrix above by solving the equation

(30 − λ)(30 − λ) − 28 × 28 = 0

as we did in Example 11.2 . The solution is λ = 58 and λ = 2.

Following the same procedure as in Example 11.2 , we must solve

When we multiply out the matrix and vector we get two equations

Search WWH ::

Custom Search

Home