Data Preparation Basic Models - Data Preprocessing in Data Mining

Graphics Reference

In-Depth Information

Chapter 3

Data Preparation Basic Models

Abstract The basic preprocessing steps carried out in Data Mining convert real-

world data to a computer readable format. An overall overview related to this topic

is given in Sect. 3.1 . When there are several or heterogeneous sources of data, an

integration of the data is needed to be performed. This task is discussed in Sect.

3.2 . After the data is computer readable and constitutes an unique source, it usually

goes through a cleaning phase where the data inaccuracies are corrected. Section

3.3 focuses in the latter task. Finally, some Data Mining applications involve some

particular constraints like ranges for the data features, which may imply the normal-

ization of the features (Sect. 3.4 ) or the transformation of the features of the data

distribution (Sect. 3.5 ).

3.1 Overview

Data gathered in data sets can present multiple forms and come from many different

sources. Data directly extracted from relational databases or obtained from the real

world is completely raw: it has not been transformed, cleansed or changed at all.

Therefore, it may contain errors due to wrong data entry procedures or missing data,

or inconsistencies due to ill-handled merging data processes.

Three elements define data quality [ 15 ]: accuracy, completeness and consistency.

Unfortunately real-world data sets often present the opposite conditions, and the

reasons may vary as mentioned above. Many preprocessing techniques have been

devised to overcome the problems present in such real-world data sets and to obtain

a final, reliable and accurate data set to later apply a DM technique [ 35 ].

Gathering all the data elements together is not an easy task when the examples

come from different sources and they have to be merged in a single data set. Integrat-

ing data fromdifferent databases is usually called data integration . Different attribute

names or table schemes will produce uneven examples that need to be consolidated.

Moreover, attribute values may represent the same concept but with different names

creating inconsistencies in the instances obtained. If some attributes are calculated

from the others, the data sets will present a large size but the information contained

will not scale accordingly: detecting and eliminating redundant attributes is needed.

Search WWH ::

Custom Search

Home