Database Reference
In-Depth Information
(partly due to the partitioning constraints necessary to keep computation feasible)
MTV typically finds pattern sets in the order of tens of patterns.
6.2
Tiles
The k maximal tiling problem, as defined by Geerts et al. [ 21 ] is perhaps the earliest
example of pattern set mining. Xiang et al. [ 74 , 75 ] expanded on the problem setting
and aim to cover as many of the 1s of the data with k noisy tiles. Both problem
settings are NP-hard, and are related to Set Cover. As such, the greedy strategy is
known to find a solution within O ( log n ) of the optimum.
Boolean Matrix Factorization Strongly related to tiling is the problem of Boolean
Matrix Factorization (BMF) [ 49 ]. The goal in matrix factorization is to find a low-
rank approximation of the data. In case of BMF, each factor can be regarded as a
noisy tile, and a factorization hence as a set of tiles. BMF is known to be NP-hard, as
well as NP-hard to approximate [ 49 ]. The ASSO algorithm is a heuristic for finding
good k -factorizations, and can be coupled with an MDL strategy to automatically
determine the best model order [ 47 ]. Lucchese et al. [ 40 ] gave the much faster PANDA
algorithm, which optimizes a more loose global objective that weighs the number of
covered 1s with the number of 1s of the factors.
Geometric Tiles So far we have only considered unordered binary data in this
chapter. When we fix the order of the rows and columns, however, for instance
because the data is ordered spatially, it may only make sense to consider tiles that
are consecutive under this ordering. This problem setting gives rise to interesting
problem settings, as well as algorithmic solutions.
Gionis et al. [ 22 ] propose to mine dense geometric tiles that stand out from the
background distribution, and to do so iteratively in order to construct a tree of tiles.
To determine whether a tile is significant, they propose a simple MDL based score.
Finding the optimal tile under this score is O ( n 2 m 2 ) and hence infeasible for non-
trivial data. To circumvent this problem, they propose a randomized approach. Tatti
and Vreeken [ 68 ] recently proposed an improved algorithm that can find the optimal
sub-tile in only O ( mn min(m, n)). The tile trees discovered by these methods typically
contain in the order of 10 s to 100 s of tiles.
Gionis et al. [ 22 ] also showed that by the same strategy, by first applying spectral
ordering, meaningful combinatorial tiles can be discovered from unordered binary
data.
6.3
Swap Randomization
The final pattern set mining approach we consider is based on swap randomization.
Lijffijt et al. [ 39 ] aim to find the smallest set of patterns that explains most about the
Search WWH ::




Custom Search