Data Preparation Basic Models - Data Preprocessing in Data Mining

Graphics Reference

In-Depth Information

Having an uniform data set without measurable inconsistences does not mean

that the data is clean. Errors like MVs or uncontrolled noise may be still present. A

data cleaning step is usually needed to filter or correct wrong data. Otherwise, the

knowledge extracted by a DM algorithm will be barely accurate or DM algorithms

will not be able to handle the data.

Ending up with a consistent and almost error-free data set does not mean that the

data is in the best form for a DM algorithm. Some algorithms in DM work much

better with normalized attribute values, such as ANNs or distance-based methods.

Others are not able to work with nominal valued attributes, or benefit from subtle

transformations in the data. Data normalization and data transformation techniques

have been devised to adapt a data set to the needs of the DM algorithm that will be

applied afterwards. Note that eliminating redundant attributes and inconsistencies

may still yield a large data set that will slow down the DM process. The use of data

reduction techniques to transform the data set are quite useful, as they can reduce

the number of attributes or instances while maintaining the information as whole as

possible.

To sumup, real-world data is usually incomplete, dirty and inconsistent. Therefore

data preprocessing techniques are needed to improve the accuracy and efficiency the

subsequent DM technique used. The rest of the chapter further describes the basic

techniques used to perform the preparation of the data set, while leading the reader

to the chapters where advanced techniques are deeper described and presented.

3.2 Data Integration

One hard problem in DM is to collect a single data set with information coming from

varied and different sources. If the integration process is not properly performed

redundancies and inconsistencies will soon appear, resulting in a decrease of the

accuracy and speed of the subsequent DM processes. Matching the schema from dif-

ferent sources presents a notorious problem that usually does not usually come alone:

inconsistent and duplicated tuples as well as redundant and correlated attributes are

problems that the data set creation process will probably show sooner or later.

An essential part of the integration process is to build a data map that establishes

how each instance should be arranged in a common structure to present a real-

world example taken from the real world. When data is obtained from relational

databases, it is usually flattened , gathered together into one single record. Some

database frameworks enable the user to provide amap to directly traverse the database

through in-database access utilities. While using this in-database mining tools has

the advantage of not having to extract and create an external file for the data, it is not

the best option for its treatment with preprocessing techniques. In this case extracting

the data is usually the best option. Preprocessing takes time, and when the data is

kept in the database, preprocessing has to be applied repeatedly. In this way, if the

data is extracted to an external file, then processing and modeling it can be faster if

the data is already preprocessed and completely fits in memory.

Search WWH ::

Custom Search

Home