Database Reference
In-Depth Information
Apart from border cases, Q will typically be very large and may even contain in-
finitely many distributions. We need to have one distribution, and hence our next step
is to choose one from Q . We do this by the Maximum Entropy principle. Formally,
we identify this distribution by
p =
arg max
p Q
p ( ω ) log p ( ω ),
ω
where the standard convention 0
0 is used.
Besides nice information-theoretical properties, the maximum entropy distribu-
tion also has many other interesting and practical properties. For instance, it has a
very useful regular form [ 16 ].
First consider that for some ω
×
log 0
=
, every distribution p
Q has p ( ω )
=
0. Since
p
, this immediately implies that p ( ω )
=
0. Let us define Z be the set of such
vectors Z
={
ω
|
p ( ω )for all p
Q
}
. The probability of the remaining points
\
Z can be expressed as follows: there is a set of numbers r 0 , ... , r M , such that
exp r 0 +
r i S i ( ω ) for
M
p ( ω )
=
ω
\
Z.
(5.1)
i = 1
The coefficient r 0 acts as a normalization constant. This form is the well-known
log-linear model.
Now that we have established the general form of the maximum entropy model,
let us look at some special cases of background information.
Assume that we do not provide any constraints, then the maximum entropy dis-
tribution will be the uniform distribution, p ( ω )
. Consider now that we limit
ourselves by setting K constraints, one for each item, S i ( ω )
=
1 / | |
ω i . Then, E [ S i ]is
the i th column margin, and we can show by simple manipulation of Eq. 5.1 that p
corresponds to the independence model.
Consider now the other extreme, where we provide 2 K
=
1 constraints, one for
each non-empty itemset, by setting S X ( ω )tobe1if ω contains X , and 0 otherwise.
We can show using inclusion-exclusion tricks that there is only one distribution in Q .
If the corresponding targets θ X were computed from a dataset D , then p is equal to
the empirical distribution p emp computed from D , p emp ( ω )
.
Consider now that we do not use all itemset constraints. Instead we have a partition
P and our itemset constraints consists only of itemsets that are subsets of blocks of
P . In this case, p will have independent items belonging to different blocks. In other
words, p is a partition model. Another example of non-trivial itemset constraints is a
set consisting of all singleton itemsets and itemsets of size 2 such that these itemsets
form a tree when viewed as edges over the items. In such case, p corresponds to
the Chow-Liu tree model [ 13 ], a special case of Bayesian network where items may
have only one parent. As an example of constraints not related to itemsets, consider
T k ( t ), being equal to 1 if and only if t contains k 1 s, and 0 otherwise [ 65 ]. In such
case, ET k is the probability that a random transaction has k 1s.
All the cases we describe above have either closed form or can be computed
efficiently. In general we can infer the MaxEnt distribution using iterative approaches,
=|{
t
D
|
t
=
ω
}|
/
|
D
|
Search WWH ::




Custom Search