Interesting Patterns - Frequent Pattern Mining

Database Reference

In-Depth Information

Apart from border cases, Q will typically be very large and may even contain in-

finitely many distributions. We need to have one distribution, and hence our next step

is to choose one from Q . We do this by the Maximum Entropy principle. Formally,

we identify this distribution by

p ∗ =

arg max

p ∈ Q −

p ( ω ) log p ( ω ),

∈

where the standard convention 0

0 is used.

Besides nice information-theoretical properties, the maximum entropy distribu-

tion also has many other interesting and practical properties. For instance, it has a

very useful regular form [ 16 ].

First consider that for some ω

log 0

∈

, every distribution p

∈

Q has p ( ω )

0. Since

p ∗ ∈

, this immediately implies that p ∗ ( ω )

0. Let us define Z be the set of such

vectors Z

∈

p ( ω )for all p

∈

}

. The probability of the remaining points

Z can be expressed as follows: there is a set of numbers r 0 , ... , r M , such that

exp r 0 +

r i S i ( ω ) for

p ∗ ( ω )

∈

(5.1)

i = 1

The coefficient r 0 acts as a normalization constant. This form is the well-known

log-linear model.

Now that we have established the general form of the maximum entropy model,

let us look at some special cases of background information.

Assume that we do not provide any constraints, then the maximum entropy dis-

tribution will be the uniform distribution, p ( ω )

. Consider now that we limit

ourselves by setting K constraints, one for each item, S i ( ω )

1 / | |

ω i . Then, E [ S i ]is

the i th column margin, and we can show by simple manipulation of Eq. 5.1 that p ∗

corresponds to the independence model.

Consider now the other extreme, where we provide 2 K

1 constraints, one for

each non-empty itemset, by setting S X ( ω )tobe1if ω contains X , and 0 otherwise.

We can show using inclusion-exclusion tricks that there is only one distribution in Q .

If the corresponding targets θ X were computed from a dataset D , then p ∗ is equal to

the empirical distribution p emp computed from D , p emp ( ω )

−

Consider now that we do not use all itemset constraints. Instead we have a partition

P and our itemset constraints consists only of itemsets that are subsets of blocks of

P . In this case, p ∗ will have independent items belonging to different blocks. In other

words, p ∗ is a partition model. Another example of non-trivial itemset constraints is a

set consisting of all singleton itemsets and itemsets of size 2 such that these itemsets

form a tree when viewed as edges over the items. In such case, p ∗ corresponds to

the Chow-Liu tree model [ 13 ], a special case of Bayesian network where items may

have only one parent. As an example of constraints not related to itemsets, consider

T k ( t ), being equal to 1 if and only if t contains k 1 s, and 0 otherwise [ 65 ]. In such

case, ET k is the probability that a random transaction has k 1s.

All the cases we describe above have either closed form or can be computed

efficiently. In general we can infer the MaxEnt distribution using iterative approaches,

=|{

∈

Frequent Pattern Mining

Search WWH ::

Custom Search

Home