Database Reference
In-Depth Information
PCA is based on linear correlations. The concept of linear correlation and
the measure of the Pearson correlation coefficient were presented in the previous
chapter. PCA examines the correlations among the original inputs and uses this
information to construct the appropriate composite measures, named principal
components.
The goal of PCA is to extract the smallest number of components which
account for as much as possible of the information of the original fields. Moreover,
a typical PCA derives uncorrelated components, a characteristic that makes them
appropriate as input to many other modeling techniques, including clustering.
The derived components are typically associated with a specific set of the original
fields. They are produced by linear transformations of the inputs, as shown by
the following equations, where F i denotes the input fields ( n fields) used for the
construction of the components ( m components):
a 11 F 1
a 12 F 2
a 1 n F n
Component 1
=
+
+···+
a 21 F 1
a 22 F 2
a 2 n F n
Component 2
=
+
+···+
. Component m
a m 1 F 1
a m 2 F 2
a mn F n
=
+
+···+
The coefficients are automatically calculated by the algorithm so that the loss
of information is minimal. Components are extracted in decreasing order of
importance, with the first one being the most significant as it accounts for the
largest amount of the total original information. Specifically, the first component
is the linear combination that carries as much as possible of the total variability
of the input fields. Thus, it explains most of their information. The second
component accounts for the largest amount of the unexplained variability and is also
uncorrelated with the first component. Subsequent components are constructed
to account for the remaining information.
Since n components are required to fully account for the original information
of n input fields, the question is ''where do we stop and how many factors should
we extract?'' Although there are specific technical criteria that can be applied to
guide analysts in the procedure, the final decision should take into account criteria
such as the interpretability and the business meaning of the components. The
final solution should balance simplicity with effectiveness, consisting of a reduced
and interpretable set of components that can adequately represent the original
fields.
Apart from PCA, a related statistical technique commonly used for data
reduction is factor analysis. It is a quite similar technique that tends to produce
results comparable to PCA. Factor analysis is mostly used when the main scope
of the analysis is to uncover and interpret latent data dimensions, whereas PCA is
typically the preferred option for reducing the dimensionality of the data.
Search WWH ::




Custom Search