Interesting Patterns - Frequent Pattern Mining

Database Reference

In-Depth Information

5.4

Swap Randomization

As the last model we consider for dynamic ranking, we return to the swap randomiza-

tion model we first described in Sect. 4 . Recall that using this model we can sample

random datasets of fixed row and column margins, and that we can use these samples

to obtain empirical p -values.

We can extend this model by requiring that the sampled datasets must also have

fixed frequencies for a certain given set of itemsets. Unsurprisingly, this makes

sampling datasets very difficult, however. In fact, producingaanewdataset satisfying

all the constraints is computationally intractable in general, even if we have original

dataset at our disposal [ 29 ].

Instead of forcing hard constraints, we can relax these conditions and require that

the probability of a random dataset R should decrease as the frequencies of given

itemsets diverge from the target frequencies. Datasets that satisfy the given itemsets

exactly will have the largest probability but other datasets are also possible. This

relaxation allows us to use the same, well-understood, MCMC techniques as for

standard swap randomization [ 29 ].

6

Pattern Sets

Pattern set mining is the fourth and final approach to discovering interesting patterns

that we cover, and is also the most recent. It is strongly related to the dynamic

modeling approach we met in the previous section, but has a slightly different twist

and implications.

The general idea in pattern set mining is simple: instead of measuring the interest-

ingness of each pattern X individually, i.e., through q ( X ), we now define q over sets

of patterns

. That is, instead of evaluating a pattern X only locally, e.g., checking

whether it describes significant local structure of the data, the goal is now defined

globally. As such, we aim to find that set of patterns

X

that is optimal with regard

to q . For example, we can now say we want to find that set of patterns that together

describes the structure of the data best, i.e., that models the full joint distribution of

the data best.

By measuring quality over sets instead of individual patterns, we face a combi-

natorial problem over an exponential space of candidate elements. That is, to find

the optimal solution naively we would have to consider every possible subset out of

the space of 2 | I | possible patterns in the data. Sadly, none of the proposed quality

measures for pattern set mining exhibit monotonicity, and hence we have no efficient

strategy to obtain the optimal pattern set. Moreover, while for some measures we

know that even approximating the optimum within any constant factor is NP-hard

[ 49 ], for most measures the score landscape is so erratic we so far have no results at

all on the complexity of the optimization problem.

In light of the search space and the difficulty of the problem, most pattern set

mining methods employ heuristics. In particular, the locally optimal greedy strategy

X

Frequent Pattern Mining

Search WWH ::

Custom Search

Home