What Is Data Mining and How Does It Work? - Discrimination and Privacy in the Information Society

Database Reference

In-Depth Information

they represent the complete dataset or the relation of a response variable to the

other variables. Rather the interestingness of a pattern is determined by how

surprising it is; 20 in that respect it is even more likely that a pattern diverging from

the global structure of the dataset is surprising. The standard example of pattern

mining is association rule mining. In association rule mining a single table of 0-1

data is given and the goal is to detect relations between the columns that hold in

many rows.

Example 4 (Pattern Mining) Consider sales data of a supermarket; the rows in

the dataset correspond to clients, and the columns to products. 1 in the column of

a product A for a customer C indicates that the customer C bought the product A.

An association rule could be that customers that buy diapers, also buy beer in

60% of the cases. Such a rule would be surprising if this 60% deviates a lot from

the overall frequency of beer among all customers.

Another example of pattern mining is finding relationships in the database that can

be used to describe the data and/or predict attributes of data subjects. This is

usually done with regression , i.e., finding a function to describe the data. The

simplest regression is linear regression , which is used to find the line that best fits

the data. Linear regression is done using the least squares method . 21 Non-linear

regression is also possible, but is mostly done when it may reasonably be expected

that the data are better described using non-linear functions. Examples of non-

linear regression are exponential functions (for instance, for increasing growth),

cyclical functions (for instance, for seasonal influences), and Gaussian functions

(for normal distributions). Combinations of these functions are also possible, such

as a combination of linear growth and seasonal influences.

One of the main concerns when using regression is whether the chosen function

is a good description of the data. The quality of such a fit is often expressed by the

so-called correlation coefficient . 22 The value r of the correlation coefficient is

always between -1 and +1. 23 When r = 1, the line is a perfect fit for the data, i.e.,

all data points are on the line. This is called perfect positive correlation. In the

20 See Subsection 1.1.2.

21 The least squares method involves a minimization procedure of distances of the data

values from the regression function. In the case of a linear fit y=αx+β, the values of α

and

are calculated by minimizing



(

−

)

22 The correlation coefficient is independent of the type of regression. In its simplest form,

the correlation coefficient between two parameters x and y is



−













−









−









23 The value is often expressed as a percentage, but since this is usually done in positive

values, it is impossible to distinguish between positive and negative correlation.

Discrimination and Privacy in the Information Society

Search WWH ::

Custom Search

Home