Database Reference
In-Depth Information
itemset are also free. In our example, frequent free itemsets are
F free ={ a , b , c , d }
.
Itemset cd is also free but it is not frequent.
Non-Derivable Itemsets Calders and Goethals [ 12 ] developed the notion of support
derivability a step further, and proposed to mine non-derivable frequent itemsets. An
itemset is said to be derivable if we can derive its support using inclusion/exclusion
rules, i.e., the upper and lower bound of its support are equal. In our example, abc is a
derivable itemset: since supp ( bc )
2 we know that whenever c appears,
b appears as well. Hence, it follows that supp ( abc )
=
supp ( c )
=
=
supp ( ac )
=
2. In our example
the set of non-derivable itemsets for minsup
.
Like free itemsets, non-derivable itemsets are also monotonically downward closed,
which allows us to mine them efficiently.
In practice, for a given database and threshold the number of closed itemsets and
non-derivable itemsets is typically comparable; how many exactly depends on the
structure of the data. In both cases, for clean data, up to orders of magnitude fewer
itemsets are returned than when mining frequent itemsets. However, if the data is
noisy, it can be that no reduction can be obtained and we still find millions or billions
of itemsets for non-trivial thresholds.
=
2is
F ndi
={
a , b , c , ab , ac , bc
}
Margin-Closed and Robust Frequent Itemsets Moerchen et al. [ 50 ] hence argues
to prune more aggressively, and to this end proposes to relax the requirement on
maintaining frequencies exactly. That is, to mine margin-closed frequent itemsets;
essentially reporting only those frequent itemsets for which the support deviates more
than a certain amount compared to their subsets. A related, but different approach
was recently proposed by Tatti and Moerchen [ 66 ], whom acknowledge the data
at hand is just a sample; whether a given itemset is frequent, maximal, closed, or
non-derivable may just be happenstance. To this end they propose to mine only those
itemsets that exhibit a given property robustly , i.e., in many random subsamples of
the data. For example, the idea is that in the more (sub)samples of the data we find a
certain itemset to be closed, the more informative it is to report this particular itemset
to the end-user. A happy coincidence of robustness is that the monotonicity of the
chosen property propagates. That is, if the property is monotonic, for example, non-
derivability or freeness, the robust version is also monotonic, and hence for those
measures we can mine robust itemsets efficiently.
Sampling Frequent Itemsets We should stress that A Priori works for any mono-
tonic measure, for example, the Jaccard-distance based measure Cohen et al. [ 14 ]
propose,
supp ( X ) /
|{
t
D
|
X
t
= ∅}|
,
the support of the itemset divided by the number of transactions that share at least
one element with X . However, while monotonic, in practice A Priori is impractical
for this measure: we cannot prune any singletons, and hence have
, by which
already at the first step we have to check all itemsets of size 2. To circumvent this
problem, Cohen et al. first of all consider only itemsets of length 2, and, only calculate
the actual score for a sample of the complete candidate set. However, because of the
exponential search space, to avoid mining mostly itemsets with very low scores,
F 1 = I
Search WWH ::




Custom Search