Interesting Patterns - Frequent Pattern Mining

Database Reference

In-Depth Information

one will have to be careful what distribution to sample from. Cohen et al. sample

according to a hash-based score that estimates the correlation between two items. In

theory this approach can be extended to itemsets of arbitrary sizes, but will require

a non-trivial extension of this estimate.

A possible solution to this end may have been given by Boley et al. [ 7 ], whom

proposed a framework that allows to directly sample itemsets proportional to any

score based on frequency and/or cardinality of a pattern. However, the more different

'components' a score has, the more computationally expensive the pre-processing

becomes. In a follow-up paper [ 8 ], the same authors refined the procedure and

removed the need for this pre-processing by formalizing a coupling-from-the-past

MCMC sampler.

Al Hassan and Zaki [ 4 ] proposed a different approach that allows for directly

sampling the output space of any pattern miner. While the paper discusses patterns in

graphs, the same techniques can be applied for mining itemsets. In follow-up work

they discuss Origami [ 5 ], an approach to sample patterns that are representative,

as well as orthogonal to earlier sampled patterns. By sampling patterns not just

proportionally to a static distribution, but with regard to earlier sampled results, this

process comes rather close to dynamic ranking, which we will discuss in more detail

in Sect. 5 .

2.2

Tiles

The next class of absolute interestingness measures we consider are not for itemsets,

but for tiles . In tile mining we are particularly interested in the area a pattern covers

in the data. That is,

L

consists of tiles T

=

( X , Y ) which are defined by both an

intention , a subset of all items X

⊆ I

, as well as an extension , a subset of all rows

Y

⊆ R

. We then use q to calculate the interestingness over the cells of D identified

by X

×

Y .

Large Tile Mining The most constrained variant of this task is to mine exact tiles,

tiles for which in D we find only 1s, that meet a minarea threshold. That is, tile

for which area ( T )

minarea . A maximal tile is then a tile T for which

we cannot add an element to X or Y , and updating the vice-versa to maintain the

all-1s constraint, without the area ( T ) decreasing. Note that as area does not exhibit

monotonicity, the level-wise algorithm cannot be applied.

=|

X

||

Y

|≥

Intuition Large areas of only 1s in D are interesting.

Geerts et al. [ 21 ], however, gave a set of constraints that can be used to mine large

tiles efficiently in practice; essentially implementing the greedy algorithm for Set

Cover [ 21 ]. It is interesting to note that every large tile is a closed frequent itemset,

an observation Xiang et al. [ 74 ] used in their algorithm, first mining closed frequent

itemsets and then pruning this set.

Noise and large tile mining do not go together well. That is, given a dataset in

which there exists one large tile against an empty background, simply by flipping

Frequent Pattern Mining

Search WWH ::

Custom Search

Home