Graphics Reference
In-Depth Information
1.6.1.4 Data Normalization
The measurement unit used can affect the data analysis. All the attributes should be
expressed in the same measurement units and should use a common scale or range.
Normalizing the data attempts to give all attributes equal weight and it is particularly
useful in statistical learning methods.
1.6.1.5 Missing Data Imputation [ 23 ]
It is a form of data cleaning, where the purpose is to fill the variables that contain
MVs with some intuitive data. In most of the cases, adding a reasonable estimate of
a suitable data value is better than leaving it blank.
1.6.1.6 Noise Identification [ 29 ]
Included as a step of data cleaning and also known as the smoothing in data trans-
formation, its main objective is to detect random errors or variances in a measured
variable. Note that we refer to the detection of noise instead of the removal of noise,
which is more related to the IS task within data reduction. Once a noisy example is
detected, we can apply a correction-based process that could involve some kind of
underlying operation.
1.6.2 Data Reduction
Data reduction comprises the set of techniques that, in one way or another, obtain a
reduced representation of the original data. For us, the distinction of data preparation
techniques is those that are needed to appropriately suit the data as input of a DM
task. As we have mentioned before, this means that if data preparation is not properly
conducted, the DM algorithms will not be run or will surely report wrong results
after running. In the case of data reduction, the data produced usually maintains
the essential structure and integrity of the original data, but the amount of data is
downsized. So, can the original data be used, without applying a data reduction
process, as input of a DM process? The answer is yes, but other major issues must be
taken into account, being just as crucial as the issues addressed by data preparation.
Hence, at a glance, it can be considered as an optional step. However, this af-
firmation may be conflictive. Although the integrity of the data is maintained, it is
well known that any algorithm has a certain time complexity that depends on several
parameters. In DM, one of these parameters is somehow directly proportional to the
size of the input database. If the size exceeds the limit, the limit being very depen-
dant on the type of DM algorithms, the running of the algorithm can be prohibitive,
and then the data reduction task is as crucial as data preparation is. Regarding other
 
Search WWH ::




Custom Search