Interesting Patterns - Frequent Pattern Mining

Database Reference

In-Depth Information

to find the sequence of tiles such that each top- k covers as many 1s as possible using

k exact tiles. That is, every tile in the ranking has to cover as many uncovered 1s as

possible: any 1s the k -th tile covers already covered by a tile of rank

≤ k are simply

not counted—and hence redundancy is directly punished. Geerts et al. [ 21 ] call this

the maximal tiling problem, and identify it as an instance of the set cover problem,

which is known to be NP-hard [ 36 ].

For noisy tiles, overlap alone does not suffice to identify redundancy, as different

tiles may explain areas of the data in more fine or coarse detail. We should therefore

consider the quality of the tile, for example by punishing the noise, or favoring tiles

that are surprising.

To this end we can re-use the Rasch model [ 18 ], now using it to discover surprising

tile sets. Similar for ranking tiles based on area, we can also rank tilings by computing

the likelihood of entries covered by the tile set [ 37 ]. Similarly, to rank individual tiles,

we need to normalize this probability, as otherwise the best tiling is automatically

one tile containing the whole dataset. Kontonasios and De Bie [ 37 ] do not explicitly

update their model, but instead consider the case where the analyst would investigate

every tile exactly. As such, the values (0 s and 1 s) of a processed tile can be assumed

known, and hence the likelihoods of already covered cells set to 1. They further

show this covering problem is an instance of Weighted Budgeted Maximum Set

Cover, which is NP-hard, but for which the greedy approach is known to provide

good solutions.

The MaxEnt model composed from tiles can be also used in a similar manner as

the MaxEnt model from itemsets. For instance, given a set of tiles and their densities,

that is, the fraction of 1 s inside each tile, we can construct the corresponding MaxEnt

model and use the likelihood of the data as the goodness of the tile set [ 67 ]. Besides for

ranking a set of candidate tiles, Tatti and Vreeken show that many (exploratory) data

mining results on binary data can be translated into sets of noisy tiles. Hence, through

the same machinery, we can dynamically rank results from different algorithms based

on their relative informativeness [ 67 ].

Whereas the basic MaxEnt allows only frequencies of 1s within tiles as back-

ground knowledge, a recent paper by Kontonasios and De Bie [ 38 ], demonstrates

how more complex information can be incorporated. Examples include frequencies

of itemsets—unluckily, however, as we saw for the transaction based MaxEnt model,

this does mean that inferring from the model becomes the NP-hard in general.

In the introduction we spoke about the impossibility of formalizing the inherently

subjective notion of interestingness in general. Conceptually, however, dynamic

ranking comes very close. As long as we can infer the Maximum Entropy model for

the background knowledge an arbitrary user has, under the assumption that the user

can optimally make all inferences from this knowledge, we know from information

theory that our framework will correctly identify the most surprising result. De Bie

[ 17 ] argues that this setup is one of the most promising for measuring subjective

interestingness. The key challenges he identifies are in defining Maximum Entropy

models for rich data and pattern types, as well as for efficiently mining those patterns

that optimize the score.

Frequent Pattern Mining

Search WWH ::

Custom Search

Home