Database Reference
In-Depth Information
score, and are hence potentially interesting, the higher score for abc identifies this
pattern as the most interesting of the two.
In this example the outcome follows our intuition, but in practice this is not always
the case: lift is a rather ad-hoc score. This is due to it comparing the two absolute
values directly, without taking into account how likely these values, or their ratio is,
given the background model.
We can, however, also compare the deviation by performing a proper statistical
test. In order to do so, note that according the independence model the probability of
generating a transaction containing an itemset X is equal to ind ( X ). Assume that our
dataset contains N transactions, and let Z be a random variable stating in how many
transactions X occurs. The probability that Z
=
M is equal to binomial distribution,
N
M
q M (1
q ) N M ,
=
=
=
p ( Z
M )
where
q
ind ( X )
.
Now that we have this probability, we can perform a one-sided statistical test by
computing the probability that we observe a support of fr ( X ) or higher, p ( Z
Nf r ( X )). Note that the larger frX , the smaller the p -value is.
Computing the right-hand side amounts to computing a sum of probabilities
p ( Z
possible values for M, may prove to be re-
strictively slow in practice. However, as exactly in those cases binomial distributions
are accurately approximated by a normal distribution, we can perform an alterna-
tive test by considering a normal approximation of the binomial distribution. In this
case, we can obtain the p -value by computing the tail of the normal distribution
=
M ), which as there are 2 | Z
|
N Nq , Nq (1
q ) , where q
ind ( X ). This is estimate is inaccurate if q is very
close to 0 or 1 and N is small. One rule of thumb is that if Nq > 5 and N (1
=
q ),
then this approximation is fairly accurate.
4.1.1
Beyond Frequency
So far, we only considered comparing the frequency of an itemset against its expected
value. Clearly, we do not have to limit ourselves to only this measure (or, better,
statistic).
Related to fault-tolerant itemsets we saw in Sect. 2 , we can say that an itemset X
is a violation of a transaction t if t does not contain X , X/
t , yet t does contain
some elements from X , t
. We denote the fraction of transactions being
violated by X as v ( X ). The quantity 1
X
=∅
v ( X ) is then a fraction of transactions that
either contain ( X ) or do not contain any items from X . If items are highly correlated
we expect 1
v ( X ) to be high and v ( X )tobelow.
Now, let q be the expected value of v ( X ) based on the independence model. We
can now calculate what Aggarwal and Yu call [ 2 ] the collective strength of a pattern
as follows
1
v ( X )
v ( X )
q
cs ( X )
=
×
.
1
q
1 v ( X )
v ( X )
In other words we compare the ratio of
against the expected value.
Search WWH ::




Custom Search