Graphics Reference
In-Depth Information
Having an uniform data set without measurable inconsistences does not mean
that the data is clean. Errors like MVs or uncontrolled noise may be still present. A
data cleaning step is usually needed to filter or correct wrong data. Otherwise, the
knowledge extracted by a DM algorithm will be barely accurate or DM algorithms
will not be able to handle the data.
Ending up with a consistent and almost error-free data set does not mean that the
data is in the best form for a DM algorithm. Some algorithms in DM work much
better with normalized attribute values, such as ANNs or distance-based methods.
Others are not able to work with nominal valued attributes, or benefit from subtle
transformations in the data. Data normalization and data transformation techniques
have been devised to adapt a data set to the needs of the DM algorithm that will be
applied afterwards. Note that eliminating redundant attributes and inconsistencies
may still yield a large data set that will slow down the DM process. The use of data
reduction techniques to transform the data set are quite useful, as they can reduce
the number of attributes or instances while maintaining the information as whole as
possible.
To sumup, real-world data is usually incomplete, dirty and inconsistent. Therefore
data preprocessing techniques are needed to improve the accuracy and efficiency the
subsequent DM technique used. The rest of the chapter further describes the basic
techniques used to perform the preparation of the data set, while leading the reader
to the chapters where advanced techniques are deeper described and presented.
3.2 Data Integration
One hard problem in DM is to collect a single data set with information coming from
varied and different sources. If the integration process is not properly performed
redundancies and inconsistencies will soon appear, resulting in a decrease of the
accuracy and speed of the subsequent DM processes. Matching the schema from dif-
ferent sources presents a notorious problem that usually does not usually come alone:
inconsistent and duplicated tuples as well as redundant and correlated attributes are
problems that the data set creation process will probably show sooner or later.
An essential part of the integration process is to build a data map that establishes
how each instance should be arranged in a common structure to present a real-
world example taken from the real world. When data is obtained from relational
databases, it is usually flattened , gathered together into one single record. Some
database frameworks enable the user to provide amap to directly traverse the database
through in-database access utilities. While using this in-database mining tools has
the advantage of not having to extract and create an external file for the data, it is not
the best option for its treatment with preprocessing techniques. In this case extracting
the data is usually the best option. Preprocessing takes time, and when the data is
kept in the database, preprocessing has to be applied repeatedly. In this way, if the
data is extracted to an external file, then processing and modeling it can be faster if
the data is already preprocessed and completely fits in memory.
 
Search WWH ::




Custom Search