Database Reference
In-Depth Information
case of r = -1, there is perfect negative correlation. Negative correlation exists
when one parameter increases while the other decreases, and vice versa. In the
case of positive correlation, parameters decrease or increase simultaneously. In the
case of r = 0, there is no correlation at all. No line can be found that gives a good
description of the data; any line is as good or bad as any other. In practice, a
correlation is seldom perfect, i.e., r = 1. Depending on the context, correlations of
roughly 0.75 to 0.95 are considered high. 24 When the correlation coefficient lies
between roughly -0.5 and +0.5, it is assumed that there is no correlation.
2.5 Supporting Techniques
The previous section discussed data mining techniques that aim directly at
discovering patterns and relations. This section discusses some additional
techniques that are not directly aimed at discovering patterns and relations, but
that may nevertheless significantly enhance the results of the data mining
techniques discussed in the previous section. We will distinguish pre-processing
techniques and database coupling techniques.
2.5.1 Pre-processing Techniques
An important first step when analyzing data is to make sure that the input data is
suitable for mining. Here we will briefly explain some common pre-processing
techniques:
Discretization : Some data mining methods are developed to work with
nominal attributes only; i.e., attributes that are non-numerical and do not
have any natural order. An example of such an attribute could be the brand
of a car. If the dataset does contain numerical attributes, we cannot directly
apply the data mining method as the data mining method will assume that
the attributes are nominal and contain only a limited number of distinct
values. Discretization is the process of dividing up the values of a
numerical attribute into a limited number of non-overlapping ranges. For
example, an attribute age could be divided into the ranges 0-10, 11-20, 21-
30, and so on. The exact numerical values are then replaced by the range it
falls in, effectively reducing the number of distinct values and making it
useable for the nominal data mining method. In this process, necessarily
some of the accuracy of the data is lost, but at the same time the dataset
becomes suitable for more methods, and if the ranges are carefully chosen,
the final results may even be more interpretable for a human user.
Missing value imputation : In many datasets there are sporadic values
missing in the records. For example, for some people in a dataset we
might not know their age, and record “ null ”-unknown in database
speech-instead of a value. In the discretization example we already saw
24 This goes for negative correlation as well: between -0.95 and -0.75 the correlation is
(depending on the context) considered high.
Search WWH ::




Custom Search