Database Reference
In-Depth Information
5.4
Swap Randomization
As the last model we consider for dynamic ranking, we return to the swap randomiza-
tion model we first described in Sect. 4 . Recall that using this model we can sample
random datasets of fixed row and column margins, and that we can use these samples
to obtain empirical p -values.
We can extend this model by requiring that the sampled datasets must also have
fixed frequencies for a certain given set of itemsets. Unsurprisingly, this makes
sampling datasets very difficult, however. In fact, producingaanewdataset satisfying
all the constraints is computationally intractable in general, even if we have original
dataset at our disposal [ 29 ].
Instead of forcing hard constraints, we can relax these conditions and require that
the probability of a random dataset R should decrease as the frequencies of given
itemsets diverge from the target frequencies. Datasets that satisfy the given itemsets
exactly will have the largest probability but other datasets are also possible. This
relaxation allows us to use the same, well-understood, MCMC techniques as for
standard swap randomization [ 29 ].
6
Pattern Sets
Pattern set mining is the fourth and final approach to discovering interesting patterns
that we cover, and is also the most recent. It is strongly related to the dynamic
modeling approach we met in the previous section, but has a slightly different twist
and implications.
The general idea in pattern set mining is simple: instead of measuring the interest-
ingness of each pattern X individually, i.e., through q ( X ), we now define q over sets
of patterns
. That is, instead of evaluating a pattern X only locally, e.g., checking
whether it describes significant local structure of the data, the goal is now defined
globally. As such, we aim to find that set of patterns
X
that is optimal with regard
to q . For example, we can now say we want to find that set of patterns that together
describes the structure of the data best, i.e., that models the full joint distribution of
the data best.
By measuring quality over sets instead of individual patterns, we face a combi-
natorial problem over an exponential space of candidate elements. That is, to find
the optimal solution naively we would have to consider every possible subset out of
the space of 2 | I | possible patterns in the data. Sadly, none of the proposed quality
measures for pattern set mining exhibit monotonicity, and hence we have no efficient
strategy to obtain the optimal pattern set. Moreover, while for some measures we
know that even approximating the optimum within any constant factor is NP-hard
[ 49 ], for most measures the score landscape is so erratic we so far have no results at
all on the complexity of the optimization problem.
In light of the search space and the difficulty of the problem, most pattern set
mining methods employ heuristics. In particular, the locally optimal greedy strategy
X
Search WWH ::




Custom Search