Graphics Reference
In-Depth Information
1.6.1 Data Preparation
Throughout this topic, we refer to data preparation as the set of techniques that ini-
tialize the data properly to serve as input for a certain DM algorithm. It is worth
mentioning that we prefer the data preparation notation to design parts of data pre-
processing, which is a confusing nomenclature used in previous texts as the whole
set of processes that perform data preprocessing tasks. This is not incorrect and we
respect this nomenclature, however we prefer to clearly distinguish between data
preparation and data reduction due to raised importance that the latter set of tech-
niques have been achieving in recent years and some of the clear differentiations that
can be extracted from this understanding.
Data preparation is normally a mandatory step. It converts prior useless data into
new data that fits a DM process. First of all, if data is not prepared, the DM algorithm
might not receive ir in order to operate or surely it will report errors during its runtime.
In the best of cases, the algorithm will work, but the results offered will not make
sense or will not be considered as accurate knowledge.
Thus, what are the basic issues that must be resolved in data preparation? Here,
we provide a list of questions accompanied with the correct answers involving each
type of process that belongs to the data preparation family of techniques:
￿
How do I clean up the data?—Data Cleaning.
￿
How do I provide accurate data?—Data Transformation.
￿
How do I incorporate and adjust data?—Data Integration.
￿
How do I unify and scale data?—Data Normalization.
￿
How do I handle missing data?—Missing Data Imputation.
￿
How do I detect and manage noise?—Noise Identification.
Next, we will briefly describe each one of these techniques listed above. Figure 1.3
shows an explanatory illustration of the forms of data preparation. We recall that they
will be studied more in-depth in the following chapters of this topic.
1.6.1.1 Data Cleaning
Or data cleansing, includes operations that correct bad data, filter some incorrect data
out of the data set and reduce the unnecessary detail of data. It is a general concept
that comprises or overlaps other well-known data preparation techniques. Treatment
of missing and noise data is included here, but both categories have been separated in
order to devote a deeper analysis of the intelligent proposals to them further into this
book. Other cleaning data tasks involve the detection of discrepancies and dirty data
(fragments of the original data which do not make sense). The latter tasks are more
related to the understanding of the original data and they generally require human
audit.
 
Search WWH ::




Custom Search