Database Reference
In-Depth Information
In the following sections we will focus on PCA. We will examine and explain
the PCA results and present guidelines for setting up, understanding, and using
this modeling technique. Key issues that a data miner has to face in PCA include:
• How many components are to be extracted?
• Is the derived solution efficient and useful?
• Which original fields are mostly related with each component?
• What does each component represent? In other words, what is the meaning of
each component?
The next sections will try to clarify these issues.
PCA DATA CONSIDERATIONS
PCA, as an unsupervised technique, expects only inputs. Specifically, it is appropri-
ate for the analysis of numeric continuous fields. Categorical data are not suitable
for this type of analysis.
Moreover, it is assumed that there are linear correlations among at least some
of the original fields. Obviously data reduction makes sense only in the case of
associated inputs, otherwise the respective benefits are trivial.
Unlike clustering techniques, PCA is not affected by potential differences in
the measurement scale of the inputs. Consequently there is no need to compensate
for fields measured in larger values than others.
PCA scores new records by deriving new fields representing the component
scores, but it will not score incomplete records (records with null or missing values
in any of the input fields).
HOWMANY COMPONENTS ARE TO BE EXTRACTED?
In the next section we will present PCA by examining the results of a simple
example referring to the case of a mobile telephony operator that wants to analyze
customer behaviors and reveal the true data dimensions which underlie the usage
fields given in Table 3.1. (Hereafter, for readability in all tables and graphs of
results, the field names will be presented without underlines.)
Table 3.2 lists the pairwise Pearson correlation coefficients among the above
inputs. As shown in the table, there are some significant correlations among specific
usage fields. Statistically significant correlations (at a 0.01 level) are marked by an
asterisk.
Search WWH ::




Custom Search