Interesting Patterns - Frequent Pattern Mining

Database Reference

In-Depth Information

(partly due to the partitioning constraints necessary to keep computation feasible)

MTV typically finds pattern sets in the order of tens of patterns.

6.2

Tiles

The k maximal tiling problem, as defined by Geerts et al. [ 21 ] is perhaps the earliest

example of pattern set mining. Xiang et al. [ 74 , 75 ] expanded on the problem setting

and aim to cover as many of the 1s of the data with k noisy tiles. Both problem

settings are NP-hard, and are related to Set Cover. As such, the greedy strategy is

known to find a solution within O ( log n ) of the optimum.

Boolean Matrix Factorization Strongly related to tiling is the problem of Boolean

Matrix Factorization (BMF) [ 49 ]. The goal in matrix factorization is to find a low-

rank approximation of the data. In case of BMF, each factor can be regarded as a

noisy tile, and a factorization hence as a set of tiles. BMF is known to be NP-hard, as

well as NP-hard to approximate [ 49 ]. The ASSO algorithm is a heuristic for finding

good k -factorizations, and can be coupled with an MDL strategy to automatically

determine the best model order [ 47 ]. Lucchese et al. [ 40 ] gave the much faster PANDA

algorithm, which optimizes a more loose global objective that weighs the number of

covered 1s with the number of 1s of the factors.

Geometric Tiles So far we have only considered unordered binary data in this

chapter. When we fix the order of the rows and columns, however, for instance

because the data is ordered spatially, it may only make sense to consider tiles that

are consecutive under this ordering. This problem setting gives rise to interesting

problem settings, as well as algorithmic solutions.

Gionis et al. [ 22 ] propose to mine dense geometric tiles that stand out from the

background distribution, and to do so iteratively in order to construct a tree of tiles.

To determine whether a tile is significant, they propose a simple MDL based score.

Finding the optimal tile under this score is O ( n 2 m 2 ) and hence infeasible for non-

trivial data. To circumvent this problem, they propose a randomized approach. Tatti

and Vreeken [ 68 ] recently proposed an improved algorithm that can find the optimal

sub-tile in only O ( mn min(m, n)). The tile trees discovered by these methods typically

contain in the order of 10 s to 100 s of tiles.

Gionis et al. [ 22 ] also showed that by the same strategy, by first applying spectral

ordering, meaningful combinatorial tiles can be discovered from unordered binary

data.

6.3

Swap Randomization

The final pattern set mining approach we consider is based on swap randomization.

Lijffijt et al. [ 39 ] aim to find the smallest set of patterns that explains most about the

Search WWH ::

Custom Search

Home