Data Preprocessing - Data Mining: Concepts and Techniques

Databases Reference

In-Depth Information

X 2

Y 2

Y 1

X 1

Figure 3.5 Principal components analysis. Y 1 and Y 2 are the first two principal components for the

given data.

providing important information about variance. That is, the sorted axes are such

that the first axis shows the most variance among the data, the second axis shows the

next highest variance, and so on. For example, Figure 3.5 shows the first two princi-

pal components, Y 1 and Y 2 , for the given set of data originally mapped to the axes X 1

and X 2 . This information helps identify groups or patterns within the data.

4. Because the components are sorted in decreasing order of “significance,” the data size

can be reduced by eliminating the weaker components, that is, those with low vari-

ance. Using the strongest principal components, it should be possible to reconstruct

a good approximation of the original data.

PCA can be applied to ordered and unordered attributes, and can handle sparse data

and skewed data. Multidimensional data of more than two dimensions can be han-

dled by reducing the problem to two dimensions. Principal components may be used

as inputs to multiple regression and cluster analysis. In comparison with wavelet trans-

forms, PCA tends to be better at handling sparse data, whereas wavelet transforms are

more suitable for data of high dimensionality.

3.4.4 Attribute Subset Selection

Data sets for analysis may contain hundreds of attributes, many of which may be irrel-

evant to the mining task or redundant. For example, if the task is to classify customers

based on whether or not they are likely to purchase a popular new CD at AllElectronics

when notified of a sale, attributes such as the customer's telephone number are likely to

be irrelevant, unlike attributes such as age or music taste . Although it may be possible for

a domain expert to pick out some of the useful attributes, this can be a difficult and time-

consuming task, especially when the data's behavior is not well known. (Hence, a reason

behind its analysis!) Leaving out relevant attributes or keeping irrelevant attributes may

be detrimental, causing confusion for the mining algorithm employed. This can result

in discovered patterns of poor quality. In addition, the added volume of irrelevant or

redundant attributes can slow down the mining process.

Search WWH ::

Custom Search

Home