Database Reference
In-Depth Information
support I 2 . An itemset I is called large or frequent in database D, if and only if,
its frequency in D is at least equal to a minimum frequency threshold mfreq, set
by the owner of the data. Equivalently, I is large in D, if and only, sup(I; D)
msup, where msup = mfreqN. The set of frequent itemsets for a database D is
denoted as F D =fI I : freq(I; D). All the itemsets having a frequency lower than
mfreq (equivalently a support lower than msup) are called infrequent and their set is
Ã(I)F D .
The second step in the process of association rule mining typically involves the
identification of all the significant association rules that hold among the derived
frequent itemsets. An association rule I ) J is significant if it holds in database D
with a confidence that is higher than a minimum confidence threshold mconf set by
the owner of the data, i.e. when sup (I[J;D)
sup (I;D) mconf . The support of this rule in D
is equal to that of its generating itemset, i.e. is equal to sup(I[J; D). An example of
association rule mining (borrowed from [72]) is shown on Fig. 2.1.
Rules
Confidence Support
Itemset Support
a
b ) a
100%
4
tid Itemset
T 1
6
b ) c
75%
3
abc
b
4
c ) a
100%
4
T 2
abc
c
4
c ) b
75%
3
T 3
abc
ab
4
b ) ac
75%
3
T 4
ab
ac
4
c ) ab
75%
3
T 5
a
bc
3
ab ) c
75%
3
T 6
ac
abc
3
ac ) b
75%
3
bc ) a
100%
3
Fig. 2.1: Database D along with its itemsets and related association rules.
Border Theory
The theory of the borders of the frequent itemsets is also very important in our
discussion as both border-based and exact hiding methodologies rely on a pro-
cess inspired from this theory. Let F D be the set of all frequent itemsets in D, and
P =Ã(I) be the set of all patterns in the lattice of D (e.g., see Fig. 2.2). The pos-
itive border of F D , denoted as Bd + (F D ), consists of all the maximally frequent
patterns in P, i.e. all the patterns in F D , whose all proper supersets are infrequent.
Formally, Bd + (F D ) = fI 2F D j for all J 2P with I J we have that J 2F D g.
Respectively, the negative border of F D , denoted as Bd (F D ), consists of all the
minimally infrequent patterns in P, i.e. all the patterns in PnF D , whose all proper
subsets are frequent. Formally, Bd (F D ) =fI 2PnF D j for all J I we have that
2 We will use sup(I) and freq(I) instead of sup(I; D) and freq(I; D), respectively, for notational
convenience, when database D is obvious in our context.
 
 
Search WWH ::




Custom Search