Introduction - Data Preprocessing in Data Mining

Graphics Reference

In-Depth Information

1.6.1 Data Preparation

Throughout this topic, we refer to data preparation as the set of techniques that ini-

tialize the data properly to serve as input for a certain DM algorithm. It is worth

mentioning that we prefer the data preparation notation to design parts of data pre-

processing, which is a confusing nomenclature used in previous texts as the whole

set of processes that perform data preprocessing tasks. This is not incorrect and we

respect this nomenclature, however we prefer to clearly distinguish between data

preparation and data reduction due to raised importance that the latter set of tech-

niques have been achieving in recent years and some of the clear differentiations that

can be extracted from this understanding.

Data preparation is normally a mandatory step. It converts prior useless data into

new data that fits a DM process. First of all, if data is not prepared, the DM algorithm

might not receive ir in order to operate or surely it will report errors during its runtime.

In the best of cases, the algorithm will work, but the results offered will not make

sense or will not be considered as accurate knowledge.

Thus, what are the basic issues that must be resolved in data preparation? Here,

we provide a list of questions accompanied with the correct answers involving each

type of process that belongs to the data preparation family of techniques:

How do I clean up the data?—Data Cleaning.

How do I provide accurate data?—Data Transformation.

How do I incorporate and adjust data?—Data Integration.

How do I unify and scale data?—Data Normalization.

How do I handle missing data?—Missing Data Imputation.

How do I detect and manage noise?—Noise Identification.

Next, we will briefly describe each one of these techniques listed above. Figure 1.3

shows an explanatory illustration of the forms of data preparation. We recall that they

will be studied more in-depth in the following chapters of this topic.

1.6.1.1 Data Cleaning

Or data cleansing, includes operations that correct bad data, filter some incorrect data

out of the data set and reduce the unnecessary detail of data. It is a general concept

that comprises or overlaps other well-known data preparation techniques. Treatment

of missing and noise data is included here, but both categories have been separated in

order to devote a deeper analysis of the intelligent proposals to them further into this

book. Other cleaning data tasks involve the detection of discrepancies and dirty data

(fragments of the original data which do not make sense). The latter tasks are more

related to the understanding of the original data and they generally require human

audit.

Search WWH ::

Custom Search

Home