Database Reference
In-Depth Information
they represent the complete dataset or the relation of a response variable to the
other variables. Rather the interestingness of a pattern is determined by how
surprising it is; 20 in that respect it is even more likely that a pattern diverging from
the global structure of the dataset is surprising. The standard example of pattern
mining is association rule mining. In association rule mining a single table of 0-1
data is given and the goal is to detect relations between the columns that hold in
many rows.
Example 4 (Pattern Mining) Consider sales data of a supermarket; the rows in
the dataset correspond to clients, and the columns to products. 1 in the column of
a product A for a customer C indicates that the customer C bought the product A.
An association rule could be that customers that buy diapers, also buy beer in
60% of the cases. Such a rule would be surprising if this 60% deviates a lot from
the overall frequency of beer among all customers.
Another example of pattern mining is finding relationships in the database that can
be used to describe the data and/or predict attributes of data subjects. This is
usually done with regression , i.e., finding a function to describe the data. The
simplest regression is linear regression , which is used to find the line that best fits
the data. Linear regression is done using the least squares method . 21 Non-linear
regression is also possible, but is mostly done when it may reasonably be expected
that the data are better described using non-linear functions. Examples of non-
linear regression are exponential functions (for instance, for increasing growth),
cyclical functions (for instance, for seasonal influences), and Gaussian functions
(for normal distributions). Combinations of these functions are also possible, such
as a combination of linear growth and seasonal influences.
One of the main concerns when using regression is whether the chosen function
is a good description of the data. The quality of such a fit is often expressed by the
so-called correlation coefficient . 22 The value r of the correlation coefficient is
always between -1 and +1. 23 When r = 1, the line is a perfect fit for the data, i.e.,
all data points are on the line. This is called perfect positive correlation. In the
20 See Subsection 1.1.2.
21 The least squares method involves a minimization procedure of distances of the data
values from the regression function. In the case of a linear fit y=αx+β, the values of α
and
β
are calculated by minimizing
=
.
2
(
y
α
x
β
)
i
i
i
1
22 The correlation coefficient is independent of the type of regression. In its simplest form,
the correlation coefficient between two parameters x and y is
1
x
y
x
y
i
i
i
i
n
i
=
1
i
=
1
i
=
1
r
=
.
2
2
1
1

2
x
x
y
y
i
i
i
i
n
n
i
=
1
i
=
1
i
=
1
i
=
1
23 The value is often expressed as a percentage, but since this is usually done in positive
values, it is impossible to distinguish between positive and negative correlation.         Search WWH ::

Custom Search