Database Reference
In-Depth Information
support I
2
. An itemset I is called large or frequent in database D, if and only if,
its frequency in D is at least equal to a minimum frequency threshold mfreq, set
by the owner of the data. Equivalently, I is large in D, if and only, sup(I; D)
msup, where msup = mfreqN. The set of frequent itemsets for a database D is
denoted as F
D
=fI I : freq(I; D). All the itemsets having a frequency lower than
mfreq (equivalently a support lower than msup) are called infrequent and their set is
Ã(I)F
D
.
The second step in the process of association rule mining typically involves the
identification of all the significant association rules that hold among the derived
frequent itemsets. An association rule I ) J is significant if it holds in database D
with a confidence that is higher than a minimum confidence threshold mconf set by
the owner of the data, i.e. when
sup
(I[J;D)
sup
(I;D)
mconf . The support of this rule in D
is equal to that of its generating itemset, i.e. is equal to sup(I[J; D). An example of
association rule mining (borrowed from [72]) is shown on
Fig. 2.1.
Rules
Confidence Support
Itemset Support
a
b ) a
100%
4
tid Itemset
T
1
6
b ) c
75%
3
abc
b
4
c ) a
100%
4
T
2
abc
c
4
c ) b
75%
3
T
3
abc
ab
4
b ) ac
75%
3
T
4
ab
ac
4
c ) ab
75%
3
T
5
a
bc
3
ab ) c
75%
3
T
6
ac
abc
3
ac ) b
75%
3
bc ) a
100%
3
Fig. 2.1: Database D along with its itemsets and related association rules.
Border Theory
The theory of the borders of the frequent itemsets is also very important in our
discussion as both border-based and exact hiding methodologies rely on a pro-
cess inspired from this theory. Let F
D
be the set of all frequent itemsets in D, and
P =Ã(I) be the set of all patterns in the lattice of D (e.g., see
Fig. 2.2).
The pos-
itive border of F
D
, denoted as Bd
+
(F
D
), consists of all the maximally frequent
patterns in P, i.e. all the patterns in F
D
, whose all proper supersets are infrequent.
Formally, Bd
+
(F
D
) = fI 2F
D
j for all J 2P with I J we have that J 2F
D
g.
Respectively, the negative border of F
D
, denoted as Bd
(F
D
), consists of all the
minimally infrequent patterns in P, i.e. all the patterns in PnF
D
, whose all proper
subsets are frequent. Formally, Bd
(F
D
) =fI 2PnF
D
j for all J I we have that
2
We will use sup(I) and freq(I) instead of sup(I; D) and freq(I; D), respectively, for notational
convenience, when database D is obvious in our context.