Data Preparation Basic Models - Data Preprocessing in Data Mining

Graphics Reference

In-Depth Information

In order to avoid the aforementioned problems, a very typical transformation used

for DMmethods is tomap each nominal attribute to a set of newly generated attributes.

If N is the number of different values the nominal attribute has, we will substitute

the nominal variable with a new set of binary attributes, each one representing one of

the N possible values. For each instance, only one of the N newly created attributes

will have a value of 1, while the rest will have the value of 0. The variable having

the value 1 is the variable related to the original value that the old nominal attribute

had. This transformation is also referred in the literature as 1-to- N transformation.

As [ 30 ] and [ 28 ] state, the new set of attributes are linearly dependent. That means

that one of the attribute can be dismissed without loss of information as we can infer

the value of one of the new attributes by knowing the values of the rest of them. A

problemwith this kind of transformation appears when the original nominal attribute

has a large cardinality. In this case, the number of attributes generated will be large as

well, resulting in a very sparse data set which will lead to numerical and performance

problems.

3.5.9 Transformations via Data Reduction

In the previous sections, we have analyzed the processes to transform or create new

attributes from the existing ones. However, when the data set is very large, performing

complex analysis and DMcan take a long computing time. Data reduction techniques

are applied in these domains to reduce the size of the data set while trying to maintain

the integrity and the information of the original data set as much as possible. In this

way, mining on the reduced data set will be much more efficient and it will also

resemble the results that would have been obtained using the original data set.

The main strategies to perform data reduction are Dimensionality Reduction (DR)

techniques. They aim to reduce the number of attributes or instances available in

the data set. Well known attribute reduction techniques are Wavelet transforms or

Principal Component Analysis (PCA). Chapter 7 is devoted to attribute DR. Many

techniques can be found for reducing the dimensionality in the number of instances,

like the use of clustering techniques, parametric methods and so on. The reader

will find a complete survey of IS techniques in Chap. 8 . The use of binning and

discretization techniques is also useful to reduce the dimensionality and complexity

of the data set. They convert numerical attributes into nominal ones, thus drastically

reducing the cardinality of the attributes involved. Chapter 9 presents a thorough

presentation of these discretization techniques.

References

1. Agrawal, R., Srikant, R.: Searching with numbers. IEEE Trans. Knowl. Data Eng. 15 (4), 855-

870 (2003)

2. Berry, M.J., Linoff, G.: Data Mining Techniques: For Marketing, Sales, and Customer Support.

Wiley, New York (1997)

Search WWH ::

Custom Search

Home