Introduction - Data Preprocessing in Data Mining

Graphics Reference

In-Depth Information

1.6.1.4 Data Normalization

The measurement unit used can affect the data analysis. All the attributes should be

expressed in the same measurement units and should use a common scale or range.

Normalizing the data attempts to give all attributes equal weight and it is particularly

useful in statistical learning methods.

1.6.1.5 Missing Data Imputation [ 23 ]

It is a form of data cleaning, where the purpose is to fill the variables that contain

MVs with some intuitive data. In most of the cases, adding a reasonable estimate of

a suitable data value is better than leaving it blank.

1.6.1.6 Noise Identification [ 29 ]

Included as a step of data cleaning and also known as the smoothing in data trans-

formation, its main objective is to detect random errors or variances in a measured

variable. Note that we refer to the detection of noise instead of the removal of noise,

which is more related to the IS task within data reduction. Once a noisy example is

detected, we can apply a correction-based process that could involve some kind of

underlying operation.

1.6.2 Data Reduction

Data reduction comprises the set of techniques that, in one way or another, obtain a

reduced representation of the original data. For us, the distinction of data preparation

techniques is those that are needed to appropriately suit the data as input of a DM

task. As we have mentioned before, this means that if data preparation is not properly

conducted, the DM algorithms will not be run or will surely report wrong results

after running. In the case of data reduction, the data produced usually maintains

the essential structure and integrity of the original data, but the amount of data is

downsized. So, can the original data be used, without applying a data reduction

process, as input of a DM process? The answer is yes, but other major issues must be

taken into account, being just as crucial as the issues addressed by data preparation.

Hence, at a glance, it can be considered as an optional step. However, this af-

firmation may be conflictive. Although the integrity of the data is maintained, it is

well known that any algorithm has a certain time complexity that depends on several

parameters. In DM, one of these parameters is somehow directly proportional to the

size of the input database. If the size exceeds the limit, the limit being very depen-

dant on the type of DM algorithms, the running of the algorithm can be prohibitive,

and then the data reduction task is as crucial as data preparation is. Regarding other

Search WWH ::

Custom Search

Home