Border Revision for Knowledge Hiding - Association Rule Hiding for Data Mining

Database Reference

In-Depth Information

C 4

Itemset I was infrequent in D O and became frequent in D.

The original border (Figure 9.1(i)) corresponds to the hyperplane that partitions the

universe of itemsets into two groups: the frequent itemsets of F D O (depicted on the

left of the original borderline) and their infrequent counterparts in PF D O (shown

on the right of the borderline). The optimal hiding process is then defined as the

revision of the original border so that the revised border excludes from the frequent

itemsets the sensitive ones and their supersets, i.e. all those itemsets that appear in

S max =fI 2F D O j9J 2S min ; J Ig, where S min =fI 2Sj for all J I;

J 2Sg

contains all the minimal sensitive itemsets from S.

Given the four possible scenarios for an itemset I prior and after the application

of border revision, we deduce that in an optimal hiding scenario, where no side-

effects are introduced by the hiding algorithm, C 2 should always hold, while C 4

must never hold. On the contrary, C 1 must hold for all the itemsets in PS max ,

while C 3 must hold for all the itemsets in S max . Thus, the hiding of the sensitive

itemsets can be pictured as a movement of the original borderline in the lattice to a

new position that adheres to the optimal set F 0 D =F D O S max .

Continuing our example, given that S =fe; ae; bcg (shown in squares in Figure

9.1(i)), we have that S min = fe; bcg (shown in bold i n Figure 9.1(i)) , and S max =

fe; ae; be; bc; ce; abc; aceg. Based on D O in Figure 9.1(i) we present the itemsets

of the original positive border (double underlined) and the original negative border

(single underlined). Ideally, the frequent itemsets in the sanitized database D will be

exactly those in F 0 D =fa; b; c; d; ab; ac; ad; cd; acdg. The revised borderline along

with the corresponding revised borders for D and scenarios C 1 and C 3 that pertain

to an exact hiding solution, are depicted in Figure 9.1(ii). Specifically, in this figure,

the itemsets of the revised positive border are double underlined, while those of the

revised negative border are single underlined. To enhance the clarity of the figure

only a small portion of itemsets, among those involved in C 2 , are shown. What is

then needed, is a way to modify the transactions of the original databaseD O in order

to have them support the revised border in database D.

To capture the optimal borderline, which is the revised border for the original

database D O based on the problem at hand, one can use the border theory and iden-

tify the key itemsets which separate all the frequent patterns from their infrequent

counterparts. As it is proven in [23, 24] these key itemsets correspond to the union

of the revised positive border Bd + (F 0 D ) and the revised negative border Bd (F 0 D ).

In what follows, we provide some efficient algorithms for computing the positive

and the negative borders, both for capturing the original and the revised borderline.

Algorithm 9.1 provides a straight-forward way to compute the negative border. It

achieves this by incorporating the computation to the Apriori algorithm of [7]. The

new algorithm has extra code in the NB-Apriori and the GetLargeItems procedures

to compute the border. On the other hand, the Apriori-Gen procedure is the same

as in the original version of Apriori. The proposed method for the computation of

the negative border relies on Apriori's candidate generation scheme, which uses

(k1) frequent itemsets to produce candidate k large itemsets. Thus, it achieves to

identify the first infrequent candidates in the lattice, whose all subsets are found to

Association Rule Hiding for Data Mining

Search WWH ::

Custom Search

Home