Information Technology Reference
In-Depth Information
on correlations and do not necessarily imply causation. To further illustrate this
point let us look at the following example:
Let
P
(
X
) be the probability that a transaction
T
from database
D
contains
the itemset
X
. Let
P
(
X, Y
) be the probability that both
X
and
Y
are contained
in
T ∈D
. Nowlet
X
and
Y
be stochastically independent:
P
(
X
)
· P
(
Y
)=
P
(
X, Y
)
.
Then for the confidence of the rule
X → Y
follows
conf(
X → Y
)=
P
(
Y
)
.
This simple observation shows a severe shortcoming of the support-confidence
framework. As soon as the itemset
Y
occurs comparably often in the data the
rule
X → Y
also has a high confidence value. This suggests a dependency of
Y
from
X
although in fact both itemsets are stochastically independent. To cope
with this problem additional rule quality measures have been developed.
Lift (Interest)
[7,19]
lift(
X → Y
)=
conf(
X → Y
)
P
(
Y
)
=
conf(
X → Y
)
supp(
Y
)
Lift directly addresses the above problem by expressing the deviation of the
rule confidence from
P
(
Y
). In the case of stochastic independence lift = 1 holds
true. In contrast, a value higher than 1 means that the existence of
X
as part
of a transaction “lifts” the probability for this transaction to also contain
Y
by
factor lift. The opposite is true for lift values lower than one. lift is symmetric
and therefore is an undirected measure.
Conviction
[7]
conv(
X → Y
)=
P
(
X
)
P
(
¬Y
)
P
(
X, ¬Y
)
Let
P
(
¬Y
) be the probability of a transaction
T ∈D
with
Y
T
and
P
(
X, ¬Y
)
the probability of drawing a transaction out of
D
that contains
X
but not
Y
.
conv(
X → Y
) nowexpresses in howfar
X
and
¬Y
are stochastically indepen-
dent. High values for conv(
X → Y
)-upto
∞
where
P
(
X, ¬Y
) = 0 - express
the conviction that this rule represents a causation. It is important to note that
conv is not symmetric and therefore is a directed measure.
3 The Process of Knowledge Discovery
Practical experiences showed that discovering knowledge from huge databases
affords much more than simply applying a sophisticated data mining algorithm
to a predefined dataset. In fact, people from research and practice more and
more understand knowledge discovery in databases (KDD) as