Interesting Patterns - Frequent Pattern Mining

Database Reference

In-Depth Information

one 1 to 0 makes it that the complete tile will not be discovered; instead we will find

two partitions. Every further flipped value will partition the tile more.

More in particular, it may not be realistic to expect a process to generate a tile full

of ones. Instead, we may need to relax our requirement and look for noisy ,or dense

tiles instead of exact tiles.

Noisy Tile Mining A noisy tile is a tile T associated with a frequency of ones in

the data, for which we write, slightly abusing notation,

= |{

( i , j )

∈

( X

×

Y )

|

D ij

=

1

}|

fr ( T )

.

|

X

||

Y

|

An exact tile then is a special case, with fr ( T )

1 . 0. When mining noisy tiles we

are interested in finding large areas in the data that contain many 1s, or possibly,

many 0s.

=

Intuition The more uniform the values of D over the area identified by T , i.e., the

more 1s resp. 0s we find, the more interesting the tile.

We find the problem of mining noisy tiles in many guises and embedded in many

problem settings. Examples include dense itemset mining [ 60 ], dense tile mining

[ 75 ], bi-clustering [ 55 ], and Boolean Matrix Factorization [ 49 , 40 ], as well as fault-

tolerant itemset mining [ 54 ].

Fault tolerant itemset mining for a large part follows the regular frequent itemset

mining setting, with, however, the twist that we do not just calculate support over

t

t , but also those transactions that nearly but not exactly

support t . The general approach is that, per itemset X , we are given a budget of

1s that we may use to maximize the fault-tolerant support of X [ 54 ]. Clearly, a fixed

budget favors small itemsets, as there per row fewer items can be missing. Poernomo

and Gopalkrishnan [ 56 ] gave an efficient algorithm for mining fault-tolerant itemsets

where the budget is dependent on the cardinality of the itemset.

Seppänen and Mannila [ 60 ] generalized the problem of fault-tolerant itemset

mining to dense itemset mining. That is, instead of using a fixed budget of flips, the

proposed algorithms mine itemsets for which we there exist at least σ rows such that

the density of 1s in the data projected over the itemset is at least δ .

In Boolean Matrix Factorization the goal is to find a low-rank approximation of the

full data matrix. Optimizing, as well as approximating this problem is NP-hard [ 49 ],

and hence the standard approach is to iteratively find good rank-1 approximations of

the data, i.e., large noisy tiles with high frequency. The Asso algorithm does this by

searching for tiles that exhibit high association between the rows and columns, and

has been shown to efficient heuristic for finding large noisy tiles [ 49 ].

Frequent itemsets can be used to bootstrap the search for dense areas in the data.

Xiang et al. [ 75 ] gave a fast heuristic for finding dense tiles that first mines closed

itemsets, and then iteratively combines them until a density threshold is reached. We

find a similar strategy in the PandA algorithm [ 40 ].

∈

D for which X

⊆

Frequent Pattern Mining

Search WWH ::

Custom Search

Home