Database Reference
In-Depth Information
one will have to be careful what distribution to sample from. Cohen et al. sample
according to a hash-based score that estimates the correlation between two items. In
theory this approach can be extended to itemsets of arbitrary sizes, but will require
a non-trivial extension of this estimate.
A possible solution to this end may have been given by Boley et al. [ 7 ], whom
proposed a framework that allows to directly sample itemsets proportional to any
score based on frequency and/or cardinality of a pattern. However, the more different
'components' a score has, the more computationally expensive the pre-processing
becomes. In a follow-up paper [ 8 ], the same authors refined the procedure and
removed the need for this pre-processing by formalizing a coupling-from-the-past
MCMC sampler.
Al Hassan and Zaki [ 4 ] proposed a different approach that allows for directly
sampling the output space of any pattern miner. While the paper discusses patterns in
graphs, the same techniques can be applied for mining itemsets. In follow-up work
they discuss Origami [ 5 ], an approach to sample patterns that are representative,
as well as orthogonal to earlier sampled patterns. By sampling patterns not just
proportionally to a static distribution, but with regard to earlier sampled results, this
process comes rather close to dynamic ranking, which we will discuss in more detail
in Sect. 5 .
2.2
Tiles
The next class of absolute interestingness measures we consider are not for itemsets,
but for tiles . In tile mining we are particularly interested in the area a pattern covers
in the data. That is,
L
consists of tiles T
=
( X , Y ) which are defined by both an
intention , a subset of all items X
I
, as well as an extension , a subset of all rows
Y
R
. We then use q to calculate the interestingness over the cells of D identified
by X
×
Y .
Large Tile Mining The most constrained variant of this task is to mine exact tiles,
tiles for which in D we find only 1s, that meet a minarea threshold. That is, tile
for which area ( T )
minarea . A maximal tile is then a tile T for which
we cannot add an element to X or Y , and updating the vice-versa to maintain the
all-1s constraint, without the area ( T ) decreasing. Note that as area does not exhibit
monotonicity, the level-wise algorithm cannot be applied.
=|
X
||
Y
|≥
Intuition Large areas of only 1s in D are interesting.
Geerts et al. [ 21 ], however, gave a set of constraints that can be used to mine large
tiles efficiently in practice; essentially implementing the greedy algorithm for Set
Cover [ 21 ]. It is interesting to note that every large tile is a closed frequent itemset,
an observation Xiang et al. [ 74 ] used in their algorithm, first mining closed frequent
itemsets and then pruning this set.
Noise and large tile mining do not go together well. That is, given a dataset in
which there exists one large tile against an empty background, simply by flipping
Search WWH ::




Custom Search