Database Reference
In-Depth Information
to find the sequence of tiles such that each top- k covers as many 1s as possible using
k exact tiles. That is, every tile in the ranking has to cover as many uncovered 1s as
possible: any 1s the k -th tile covers already covered by a tile of rank
k are simply
not counted—and hence redundancy is directly punished. Geerts et al. [ 21 ] call this
the maximal tiling problem, and identify it as an instance of the set cover problem,
which is known to be NP-hard [ 36 ].
For noisy tiles, overlap alone does not suffice to identify redundancy, as different
tiles may explain areas of the data in more fine or coarse detail. We should therefore
consider the quality of the tile, for example by punishing the noise, or favoring tiles
that are surprising.
To this end we can re-use the Rasch model [ 18 ], now using it to discover surprising
tile sets. Similar for ranking tiles based on area, we can also rank tilings by computing
the likelihood of entries covered by the tile set [ 37 ]. Similarly, to rank individual tiles,
we need to normalize this probability, as otherwise the best tiling is automatically
one tile containing the whole dataset. Kontonasios and De Bie [ 37 ] do not explicitly
update their model, but instead consider the case where the analyst would investigate
every tile exactly. As such, the values (0 s and 1 s) of a processed tile can be assumed
known, and hence the likelihoods of already covered cells set to 1. They further
show this covering problem is an instance of Weighted Budgeted Maximum Set
Cover, which is NP-hard, but for which the greedy approach is known to provide
good solutions.
The MaxEnt model composed from tiles can be also used in a similar manner as
the MaxEnt model from itemsets. For instance, given a set of tiles and their densities,
that is, the fraction of 1 s inside each tile, we can construct the corresponding MaxEnt
model and use the likelihood of the data as the goodness of the tile set [ 67 ]. Besides for
ranking a set of candidate tiles, Tatti and Vreeken show that many (exploratory) data
mining results on binary data can be translated into sets of noisy tiles. Hence, through
the same machinery, we can dynamically rank results from different algorithms based
on their relative informativeness [ 67 ].
Whereas the basic MaxEnt allows only frequencies of 1s within tiles as back-
ground knowledge, a recent paper by Kontonasios and De Bie [ 38 ], demonstrates
how more complex information can be incorporated. Examples include frequencies
of itemsets—unluckily, however, as we saw for the transaction based MaxEnt model,
this does mean that inferring from the model becomes the NP-hard in general.
In the introduction we spoke about the impossibility of formalizing the inherently
subjective notion of interestingness in general. Conceptually, however, dynamic
ranking comes very close. As long as we can infer the Maximum Entropy model for
the background knowledge an arbitrary user has, under the assumption that the user
can optimally make all inferences from this knowledge, we know from information
theory that our framework will correctly identify the most surprising result. De Bie
[ 17 ] argues that this setup is one of the most promising for measuring subjective
interestingness. The key challenges he identifies are in defining Maximum Entropy
models for rich data and pattern types, as well as for efficiently mining those patterns
that optimize the score.
Search WWH ::




Custom Search