Graphics Reference
In-Depth Information
Chapter 3
Data Preparation Basic Models
Abstract The basic preprocessing steps carried out in Data Mining convert real-
world data to a computer readable format. An overall overview related to this topic
is given in Sect. 3.1 . When there are several or heterogeneous sources of data, an
integration of the data is needed to be performed. This task is discussed in Sect.
3.2 . After the data is computer readable and constitutes an unique source, it usually
goes through a cleaning phase where the data inaccuracies are corrected. Section
3.3 focuses in the latter task. Finally, some Data Mining applications involve some
particular constraints like ranges for the data features, which may imply the normal-
ization of the features (Sect. 3.4 ) or the transformation of the features of the data
distribution (Sect. 3.5 ).
3.1 Overview
Data gathered in data sets can present multiple forms and come from many different
sources. Data directly extracted from relational databases or obtained from the real
world is completely raw: it has not been transformed, cleansed or changed at all.
Therefore, it may contain errors due to wrong data entry procedures or missing data,
or inconsistencies due to ill-handled merging data processes.
Three elements define data quality [ 15 ]: accuracy, completeness and consistency.
Unfortunately real-world data sets often present the opposite conditions, and the
reasons may vary as mentioned above. Many preprocessing techniques have been
devised to overcome the problems present in such real-world data sets and to obtain
a final, reliable and accurate data set to later apply a DM technique [ 35 ].
Gathering all the data elements together is not an easy task when the examples
come from different sources and they have to be merged in a single data set. Integrat-
ing data fromdifferent databases is usually called data integration . Different attribute
names or table schemes will produce uneven examples that need to be consolidated.
Moreover, attribute values may represent the same concept but with different names
creating inconsistencies in the instances obtained. If some attributes are calculated
from the others, the data sets will present a large size but the information contained
will not scale accordingly: detecting and eliminating redundant attributes is needed.
 
Search WWH ::




Custom Search