Database Reference
In-Depth Information
C 4
Itemset I was infrequent in D O and became frequent in D.
The original border (Figure 9.1(i)) corresponds to the hyperplane that partitions the
universe of itemsets into two groups: the frequent itemsets of F D O (depicted on the
left of the original borderline) and their infrequent counterparts in PF D O (shown
on the right of the borderline). The optimal hiding process is then defined as the
revision of the original border so that the revised border excludes from the frequent
itemsets the sensitive ones and their supersets, i.e. all those itemsets that appear in
S max =fI 2F D O j9J 2S min ; J Ig, where S min =fI 2Sj for all J I;
J 2Sg
contains all the minimal sensitive itemsets from S.
Given the four possible scenarios for an itemset I prior and after the application
of border revision, we deduce that in an optimal hiding scenario, where no side-
effects are introduced by the hiding algorithm, C 2 should always hold, while C 4
must never hold. On the contrary, C 1 must hold for all the itemsets in PS max ,
while C 3 must hold for all the itemsets in S max . Thus, the hiding of the sensitive
itemsets can be pictured as a movement of the original borderline in the lattice to a
new position that adheres to the optimal set F 0 D =F D O S max .
Continuing our example, given that S =fe; ae; bcg (shown in squares in Figure
9.1(i)), we have that S min = fe; bcg (shown in bold i n Figure 9.1(i)) , and S max =
fe; ae; be; bc; ce; abc; aceg. Based on D O in Figure 9.1(i) we present the itemsets
of the original positive border (double underlined) and the original negative border
(single underlined). Ideally, the frequent itemsets in the sanitized database D will be
exactly those in F 0 D =fa; b; c; d; ab; ac; ad; cd; acdg. The revised borderline along
with the corresponding revised borders for D and scenarios C 1 and C 3 that pertain
to an exact hiding solution, are depicted in Figure 9.1(ii). Specifically, in this figure,
the itemsets of the revised positive border are double underlined, while those of the
revised negative border are single underlined. To enhance the clarity of the figure
only a small portion of itemsets, among those involved in C 2 , are shown. What is
then needed, is a way to modify the transactions of the original databaseD O in order
to have them support the revised border in database D.
To capture the optimal borderline, which is the revised border for the original
database D O based on the problem at hand, one can use the border theory and iden-
tify the key itemsets which separate all the frequent patterns from their infrequent
counterparts. As it is proven in [23, 24] these key itemsets correspond to the union
of the revised positive border Bd + (F 0 D ) and the revised negative border Bd (F 0 D ).
In what follows, we provide some efficient algorithms for computing the positive
and the negative borders, both for capturing the original and the revised borderline.
Algorithm 9.1 provides a straight-forward way to compute the negative border. It
achieves this by incorporating the computation to the Apriori algorithm of [7]. The
new algorithm has extra code in the NB-Apriori and the GetLargeItems procedures
to compute the border. On the other hand, the Apriori-Gen procedure is the same
as in the original version of Apriori. The proposed method for the computation of
the negative border relies on Apriori's candidate generation scheme, which uses
(k1) frequent itemsets to produce candidate k large itemsets. Thus, it achieves to
identify the first infrequent candidates in the lattice, whose all subsets are found to
Search WWH ::




Custom Search