Databases Reference
In-Depth Information
X 2
Y 2
Y 1
X 1
Figure 3.5 Principal components analysis. Y 1 and Y 2 are the first two principal components for the
given data.
providing important information about variance. That is, the sorted axes are such
that the first axis shows the most variance among the data, the second axis shows the
next highest variance, and so on. For example, Figure 3.5 shows the first two princi-
pal components, Y 1 and Y 2 , for the given set of data originally mapped to the axes X 1
and X 2 . This information helps identify groups or patterns within the data.
4. Because the components are sorted in decreasing order of “significance,” the data size
can be reduced by eliminating the weaker components, that is, those with low vari-
ance. Using the strongest principal components, it should be possible to reconstruct
a good approximation of the original data.
PCA can be applied to ordered and unordered attributes, and can handle sparse data
and skewed data. Multidimensional data of more than two dimensions can be han-
dled by reducing the problem to two dimensions. Principal components may be used
as inputs to multiple regression and cluster analysis. In comparison with wavelet trans-
forms, PCA tends to be better at handling sparse data, whereas wavelet transforms are
more suitable for data of high dimensionality.
3.4.4 Attribute Subset Selection
Data sets for analysis may contain hundreds of attributes, many of which may be irrel-
evant to the mining task or redundant. For example, if the task is to classify customers
based on whether or not they are likely to purchase a popular new CD at AllElectronics
when notified of a sale, attributes such as the customer's telephone number are likely to
be irrelevant, unlike attributes such as age or music taste . Although it may be possible for
a domain expert to pick out some of the useful attributes, this can be a difficult and time-
consuming task, especially when the data's behavior is not well known. (Hence, a reason
behind its analysis!) Leaving out relevant attributes or keeping irrelevant attributes may
be detrimental, causing confusion for the mining algorithm employed. This can result
in discovered patterns of poor quality. In addition, the added volume of irrelevant or
redundant attributes can slow down the mining process.
 
Search WWH ::




Custom Search