Database Reference
In-Depth Information
one 1 to 0 makes it that the complete tile will not be discovered; instead we will find
two partitions. Every further flipped value will partition the tile more.
More in particular, it may not be realistic to expect a process to generate a tile full
of ones. Instead, we may need to relax our requirement and look for noisy ,or dense
tiles instead of exact tiles.
Noisy Tile Mining A noisy tile is a tile T associated with a frequency of ones in
the data, for which we write, slightly abusing notation,
= |{
( i , j )
( X
×
Y )
|
D ij
=
1
}|
fr ( T )
.
|
X
||
Y
|
An exact tile then is a special case, with fr ( T )
1 . 0. When mining noisy tiles we
are interested in finding large areas in the data that contain many 1s, or possibly,
many 0s.
=
Intuition The more uniform the values of D over the area identified by T , i.e., the
more 1s resp. 0s we find, the more interesting the tile.
We find the problem of mining noisy tiles in many guises and embedded in many
problem settings. Examples include dense itemset mining [ 60 ], dense tile mining
[ 75 ], bi-clustering [ 55 ], and Boolean Matrix Factorization [ 49 , 40 ], as well as fault-
tolerant itemset mining [ 54 ].
Fault tolerant itemset mining for a large part follows the regular frequent itemset
mining setting, with, however, the twist that we do not just calculate support over
t
t , but also those transactions that nearly but not exactly
support t . The general approach is that, per itemset X , we are given a budget of
1s that we may use to maximize the fault-tolerant support of X [ 54 ]. Clearly, a fixed
budget favors small itemsets, as there per row fewer items can be missing. Poernomo
and Gopalkrishnan [ 56 ] gave an efficient algorithm for mining fault-tolerant itemsets
where the budget is dependent on the cardinality of the itemset.
Seppänen and Mannila [ 60 ] generalized the problem of fault-tolerant itemset
mining to dense itemset mining. That is, instead of using a fixed budget of flips, the
proposed algorithms mine itemsets for which we there exist at least σ rows such that
the density of 1s in the data projected over the itemset is at least δ .
In Boolean Matrix Factorization the goal is to find a low-rank approximation of the
full data matrix. Optimizing, as well as approximating this problem is NP-hard [ 49 ],
and hence the standard approach is to iteratively find good rank-1 approximations of
the data, i.e., large noisy tiles with high frequency. The Asso algorithm does this by
searching for tiles that exhibit high association between the rows and columns, and
has been shown to efficient heuristic for finding large noisy tiles [ 49 ].
Frequent itemsets can be used to bootstrap the search for dense areas in the data.
Xiang et al. [ 75 ] gave a fast heuristic for finding dense tiles that first mines closed
itemsets, and then iteratively combines them until a density threshold is reached. We
find a similar strategy in the PandA algorithm [ 40 ].
D for which X
Search WWH ::




Custom Search