Interesting Patterns - Frequent Pattern Mining

Database Reference

In-Depth Information

score, and are hence potentially interesting, the higher score for abc identifies this

pattern as the most interesting of the two.

In this example the outcome follows our intuition, but in practice this is not always

the case: lift is a rather ad-hoc score. This is due to it comparing the two absolute

values directly, without taking into account how likely these values, or their ratio is,

given the background model.

We can, however, also compare the deviation by performing a proper statistical

test. In order to do so, note that according the independence model the probability of

generating a transaction containing an itemset X is equal to ind ( X ). Assume that our

dataset contains N transactions, and let Z be a random variable stating in how many

transactions X occurs. The probability that Z

M is equal to binomial distribution,

q M (1

q ) N − M ,

−

p ( Z

M )

where

ind ( X )

Now that we have this probability, we can perform a one-sided statistical test by

computing the probability that we observe a support of fr ( X ) or higher, p ( Z

≥

Nf r ( X )). Note that the larger frX , the smaller the p -value is.

Computing the right-hand side amounts to computing a sum of probabilities

p ( Z

possible values for M, may prove to be re-

strictively slow in practice. However, as exactly in those cases binomial distributions

are accurately approximated by a normal distribution, we can perform an alterna-

tive test by considering a normal approximation of the binomial distribution. In this

case, we can obtain the p -value by computing the tail of the normal distribution

M ), which as there are 2 | Z

N Nq , √ Nq (1

q ) , where q

ind ( X ). This is estimate is inaccurate if q is very

close to 0 or 1 and N is small. One rule of thumb is that if Nq > 5 and N (1

−

q ),

then this approximation is fairly accurate.

4.1.1

Beyond Frequency

So far, we only considered comparing the frequency of an itemset against its expected

value. Clearly, we do not have to limit ourselves to only this measure (or, better,

statistic).

Related to fault-tolerant itemsets we saw in Sect. 2 , we can say that an itemset X

is a violation of a transaction t if t does not contain X , X/

∈

t , yet t does contain

some elements from X , t

. We denote the fraction of transactions being

violated by X as v ( X ). The quantity 1

∩

=∅

v ( X ) is then a fraction of transactions that

either contain ( X ) or do not contain any items from X . If items are highly correlated

we expect 1

−

v ( X ) to be high and v ( X )tobelow.

Now, let q be the expected value of v ( X ) based on the independence model. We

can now calculate what Aggarwal and Yu call [ 2 ] the collective strength of a pattern

as follows

−

v ( X )

cs ( X )

− q

1 − v ( X )

v ( X )

In other words we compare the ratio of

against the expected value.

Frequent Pattern Mining

Search WWH ::

Custom Search

Home