Database Reference
In-Depth Information
under the model. That is, the typical goal in dynamic ranking is to find a ranking that
of which the top- k is the best explanation of the data in k terms.
To this end we construct a score in a post-hoc fashion. That is, we score a candidate
on how much information we would gain if we would include it in the background
knowledge. We do this as follows.
In general, given a set of itemsets
F
, and a set of target frequencies θ X for every
as background knowledge, we can construct a MaxEnt model p
X
F
such that
. In turn, we can use the likelihood p ( D
E [ S X ]
=
θ X for every X
F
| F
) to score
. In other words, the better p can predict all frequencies the more
likely is the data according to p , the better is the collection is
the quality of
F
F
. To be more precise,
we know that p ( D
p ( D
) and, as a special case, the equality holds
whenever the observed frequency of Y is exactly equal to the expectation derived
from
| F ∪{
Y
}
)
| F
. Moreover, the score increases as the observed frequency of Y becomes more
distant from the expected value.
Given this likelihood score we can evaluate how informative a pattern Y is about
the data in addition to our background knowledge. The question now is, how can
find good rankings and pattern sets efficiently?
Wang and Parthasararthy [ 71 ] take a pre-mined collection of frequent itemsets as
candidates, and consider these in level-wise batches. That is, they first consider the
itemsets of size 1, then of size 2, and so on. Per batch they select all itemsets for
which the predicted frequencies (L1 distance) deviates more than a given threshold,
and add all of these to the background knowledge, after which they update the model
and iterate to the next level. In order to make the ranking feasible, i.e., to get around
the NP-hardness of inferring frequencies from the MaxEnt model, the authors sample
frequencies instead of inferring them exactly.
Alternatively, Mampaey et al. [ 42 ] iteratively mine the top-most informative pat-
tern, regardless of its cardinality. To do so efficiently, they propose an efficient convex
bound which allows many candidate patterns to be pruned, as well as a method for
more efficiently inferring the MaxEnt model using a quick inclusion/exclusion based
approach. NP-hardness problems are here circumvented by partitioning the model,
either explicitly such that only patterns of up to length k are allowed, or by allowing
up to k overlapping itemsets per part.
F
5.3
Tile-based Techniques
While expressing redundancy with itemsets and generative models can be very com-
plicated, as we have to somehow determine expected frequencies of itemsets given
frequencies of other itemsets, tiles provide much more straightforward and natural
ways of measuring redundancy. For instance, we can consider the overlap between
tiles.
As a basic example of such problem, consider the large tile mining problem we
encountered in Sect. 2 , where the goal is to find all exact tiles covering at least
minarea 1s. When we cast this in the dynamic ranking framework, we would want
Search WWH ::




Custom Search