Data Mining Techniques for Segmentation - Data Mining Techniques in CRM: Inside Customer Segmentation

Database Reference

In-Depth Information

PCA is based on linear correlations. The concept of linear correlation and

the measure of the Pearson correlation coefficient were presented in the previous

chapter. PCA examines the correlations among the original inputs and uses this

information to construct the appropriate composite measures, named principal

components.

The goal of PCA is to extract the smallest number of components which

account for as much as possible of the information of the original fields. Moreover,

a typical PCA derives uncorrelated components, a characteristic that makes them

appropriate as input to many other modeling techniques, including clustering.

The derived components are typically associated with a specific set of the original

fields. They are produced by linear transformations of the inputs, as shown by

the following equations, where F i denotes the input fields ( n fields) used for the

construction of the components ( m components):

a 11 ∗ F 1

a 12 ∗ F 2

a 1 n ∗ F n

Component 1

=

+

+···+

a 21 ∗ F 1

a 22 ∗ F 2

a 2 n ∗ F n

Component 2

=

+

+···+

. Component m

a m 1 ∗ F 1

a m 2 ∗ F 2

a mn ∗ F n

=

+

+···+

The coefficients are automatically calculated by the algorithm so that the loss

of information is minimal. Components are extracted in decreasing order of

importance, with the first one being the most significant as it accounts for the

largest amount of the total original information. Specifically, the first component

is the linear combination that carries as much as possible of the total variability

of the input fields. Thus, it explains most of their information. The second

component accounts for the largest amount of the unexplained variability and is also

uncorrelated with the first component. Subsequent components are constructed

to account for the remaining information.

Since n components are required to fully account for the original information

of n input fields, the question is ''where do we stop and how many factors should

we extract?'' Although there are specific technical criteria that can be applied to

guide analysts in the procedure, the final decision should take into account criteria

such as the interpretability and the business meaning of the components. The

final solution should balance simplicity with effectiveness, consisting of a reduced

and interpretable set of components that can adequately represent the original

fields.

Apart from PCA, a related statistical technique commonly used for data

reduction is factor analysis. It is a quite similar technique that tends to produce

results comparable to PCA. Factor analysis is mostly used when the main scope

of the analysis is to uncover and interpret latent data dimensions, whereas PCA is

typically the preferred option for reducing the dimensionality of the data.

Data Mining Techniques in CRM: Inside Customer Segmentation

Search WWH ::

Custom Search

Home