Database Reference
In-Depth Information
data in terms of a global p -value. By employing the swap randomization framework,
and in particular by using empirical p -values they can consider a rich range of
patterns, including frequent itemsets as well as clusterings. The goal of finding the
smallest set that statistically explains the data is NP-hard in its general form, and
there exists no efficient algorithm with finite approximation ratio.
As many random datasets need to be sampled in order to determine significance
this approach is not as computationally efficient as some of the methods covered
above, however, it should be noted that the framework is highly general. In principle
it can be applied to any data and pattern type for which a (swap-)randomization variant
has been defined, which includes, among others, real-valued data [ 51 ], graphs [ 28 ],
and sequences [ 32 ].
7
Conclusions
With the goal of mining interesting patterns we face two main problems: first, we
need to be able to identify whether a pattern is potentially interesting, and second, we
do not want the results to be redundant. Both these concepts are subjective, and hence
there is no single correct answer to either of the two goals. Consequently, we provide
an overview of myriad of different techniques for mining interesting patterns. These
techniques range from classic reduction approaches, such as, closed itemsets and
non-derivable itemsets to statistical methods, where we either are looking patterns
that deviate most from the expectation or looking for a compact set that models the
data well.
Unlike when mining frequent itemsets, for more advanced interestingness mea-
sures we rarely have monotonicity to our advantage. This means that we cannot
prune the search space easily, and mining algorithms are hence significantly more
complex. In particular algorithms for discovering pattern sets are often heuristic.
Better understanding of these combinatorial problems and their score, for example,
by providing theoretical guarantees, is both a promising and necessary line of work
toward developing better and faster algorithms.
In this chapter we covered techniques meant only for binary data. In comparison,
discovering and defining interesting patterns and pattern sets from other types of
data, such as, sequences, graphs, or real-valued datasets is still strongly under-
developed. Both in the definition of useful interestingness measures, as well as in the
development of efficient algorithms for extracting such patterns directly from data
there exist many opportunities for exciting future work.
Regardless of data type, the key idea for future work is developing algorithms for
mining small and non-redundant sets of only the most interesting patterns. Or, to
quote Toon Calders at the 2012 ECMLPKDD most-influential paper award: “please,
please stop making new algorithms for mining all patterns”.
Search WWH ::




Custom Search