Interesting Patterns - Frequent Pattern Mining

Database Reference

In-Depth Information

under the model. That is, the typical goal in dynamic ranking is to find a ranking that

of which the top- k is the best explanation of the data in k terms.

To this end we construct a score in a post-hoc fashion. That is, we score a candidate

on how much information we would gain if we would include it in the background

knowledge. We do this as follows.

In general, given a set of itemsets

F

, and a set of target frequencies θ X for every

as background knowledge, we can construct a MaxEnt model p ∗

X

∈ F

such that

. In turn, we can use the likelihood p ∗ ( D

E [ S X ]

=

θ X for every X

∈ F

| F

) to score

. In other words, the better p ∗ can predict all frequencies the more

likely is the data according to p ∗ , the better is the collection is

the quality of

F

. To be more precise,

we know that p ∗ ( D

p ∗ ( D

) and, as a special case, the equality holds

whenever the observed frequency of Y is exactly equal to the expectation derived

from

| F ∪{

Y

}

)

≥

| F

. Moreover, the score increases as the observed frequency of Y becomes more

distant from the expected value.

Given this likelihood score we can evaluate how informative a pattern Y is about

the data in addition to our background knowledge. The question now is, how can

find good rankings and pattern sets efficiently?

Wang and Parthasararthy [ 71 ] take a pre-mined collection of frequent itemsets as

candidates, and consider these in level-wise batches. That is, they first consider the

itemsets of size 1, then of size 2, and so on. Per batch they select all itemsets for

which the predicted frequencies (L1 distance) deviates more than a given threshold,

and add all of these to the background knowledge, after which they update the model

and iterate to the next level. In order to make the ranking feasible, i.e., to get around

the NP-hardness of inferring frequencies from the MaxEnt model, the authors sample

frequencies instead of inferring them exactly.

Alternatively, Mampaey et al. [ 42 ] iteratively mine the top-most informative pat-

tern, regardless of its cardinality. To do so efficiently, they propose an efficient convex

bound which allows many candidate patterns to be pruned, as well as a method for

more efficiently inferring the MaxEnt model using a quick inclusion/exclusion based

approach. NP-hardness problems are here circumvented by partitioning the model,

either explicitly such that only patterns of up to length k are allowed, or by allowing

up to k overlapping itemsets per part.

F

5.3

Tile-based Techniques

While expressing redundancy with itemsets and generative models can be very com-

plicated, as we have to somehow determine expected frequencies of itemsets given

frequencies of other itemsets, tiles provide much more straightforward and natural

ways of measuring redundancy. For instance, we can consider the overlap between

tiles.

As a basic example of such problem, consider the large tile mining problem we

encountered in Sect. 2 , where the goal is to find all exact tiles covering at least

minarea 1s. When we cast this in the dynamic ranking framework, we would want

Frequent Pattern Mining

Search WWH ::

Custom Search

Home