What Is Data Mining and How Does It Work? - Discrimination and Privacy in the Information Society

Database Reference

In-Depth Information

case of r = -1, there is perfect negative correlation. Negative correlation exists

when one parameter increases while the other decreases, and vice versa. In the

case of positive correlation, parameters decrease or increase simultaneously. In the

case of r = 0, there is no correlation at all. No line can be found that gives a good

description of the data; any line is as good or bad as any other. In practice, a

correlation is seldom perfect, i.e., r = 1. Depending on the context, correlations of

roughly 0.75 to 0.95 are considered high. 24 When the correlation coefficient lies

between roughly -0.5 and +0.5, it is assumed that there is no correlation.

2.5 Supporting Techniques

The previous section discussed data mining techniques that aim directly at

discovering patterns and relations. This section discusses some additional

techniques that are not directly aimed at discovering patterns and relations, but

that may nevertheless significantly enhance the results of the data mining

techniques discussed in the previous section. We will distinguish pre-processing

techniques and database coupling techniques.

2.5.1 Pre-processing Techniques

An important first step when analyzing data is to make sure that the input data is

suitable for mining. Here we will briefly explain some common pre-processing

techniques:

Discretization : Some data mining methods are developed to work with

nominal attributes only; i.e., attributes that are non-numerical and do not

have any natural order. An example of such an attribute could be the brand

of a car. If the dataset does contain numerical attributes, we cannot directly

apply the data mining method as the data mining method will assume that

the attributes are nominal and contain only a limited number of distinct

values. Discretization is the process of dividing up the values of a

numerical attribute into a limited number of non-overlapping ranges. For

example, an attribute age could be divided into the ranges 0-10, 11-20, 21-

30, and so on. The exact numerical values are then replaced by the range it

falls in, effectively reducing the number of distinct values and making it

useable for the nominal data mining method. In this process, necessarily

some of the accuracy of the data is lost, but at the same time the dataset

becomes suitable for more methods, and if the ranges are carefully chosen,

the final results may even be more interpretable for a human user.

•

Missing value imputation : In many datasets there are sporadic values

missing in the records. For example, for some people in a dataset we

might not know their age, and record “ null ”-unknown in database

speech-instead of a value. In the discretization example we already saw

24 This goes for negative correlation as well: between -0.95 and -0.75 the correlation is

(depending on the context) considered high.

Search WWH ::

Custom Search

Home