Interesting Patterns - Frequent Pattern Mining

Database Reference

In-Depth Information

itemset are also free. In our example, frequent free itemsets are

F free ={ a , b , c , d }

Itemset cd is also free but it is not frequent.

Non-Derivable Itemsets Calders and Goethals [ 12 ] developed the notion of support

derivability a step further, and proposed to mine non-derivable frequent itemsets. An

itemset is said to be derivable if we can derive its support using inclusion/exclusion

rules, i.e., the upper and lower bound of its support are equal. In our example, abc is a

derivable itemset: since supp ( bc )

2 we know that whenever c appears,

b appears as well. Hence, it follows that supp ( abc )

supp ( c )

supp ( ac )

2. In our example

the set of non-derivable itemsets for minsup

Like free itemsets, non-derivable itemsets are also monotonically downward closed,

which allows us to mine them efficiently.

In practice, for a given database and threshold the number of closed itemsets and

non-derivable itemsets is typically comparable; how many exactly depends on the

structure of the data. In both cases, for clean data, up to orders of magnitude fewer

itemsets are returned than when mining frequent itemsets. However, if the data is

noisy, it can be that no reduction can be obtained and we still find millions or billions

of itemsets for non-trivial thresholds.

2is

F ndi

a , b , c , ab , ac , bc

}

Margin-Closed and Robust Frequent Itemsets Moerchen et al. [ 50 ] hence argues

to prune more aggressively, and to this end proposes to relax the requirement on

maintaining frequencies exactly. That is, to mine margin-closed frequent itemsets;

essentially reporting only those frequent itemsets for which the support deviates more

than a certain amount compared to their subsets. A related, but different approach

was recently proposed by Tatti and Moerchen [ 66 ], whom acknowledge the data

at hand is just a sample; whether a given itemset is frequent, maximal, closed, or

non-derivable may just be happenstance. To this end they propose to mine only those

itemsets that exhibit a given property robustly , i.e., in many random subsamples of

the data. For example, the idea is that in the more (sub)samples of the data we find a

certain itemset to be closed, the more informative it is to report this particular itemset

to the end-user. A happy coincidence of robustness is that the monotonicity of the

chosen property propagates. That is, if the property is monotonic, for example, non-

derivability or freeness, the robust version is also monotonic, and hence for those

measures we can mine robust itemsets efficiently.

Sampling Frequent Itemsets We should stress that A Priori works for any mono-

tonic measure, for example, the Jaccard-distance based measure Cohen et al. [ 14 ]

propose,

supp ( X ) /

∈

∩

= ∅}|

the support of the itemset divided by the number of transactions that share at least

one element with X . However, while monotonic, in practice A Priori is impractical

for this measure: we cannot prune any singletons, and hence have

, by which

already at the first step we have to check all itemsets of size 2. To circumvent this

problem, Cohen et al. first of all consider only itemsets of length 2, and, only calculate

the actual score for a sample of the complete candidate set. However, because of the

exponential search space, to avoid mining mostly itemsets with very low scores,

F 1 = I

Frequent Pattern Mining

Search WWH ::

Custom Search

Home