Inline Algorithm - Association Rule Hiding for Data Mining

Database Reference

In-Depth Information

14.1 Privacy Model

In Chapter 9, we highlighted the process of border revision and we presented a set

of algorithms which enable the computation of the revised borders that pertain to

an exact hiding solution. The inline methodology that is covered in this chapter

aims to enforce the revised borders when constructing the sanitized version D of

the original database D O . The sanitization approach that is followed is based on the

modification of selected T nm values from one to zero (1!0), thus excluding certain

items from selected transactions of D O . Since no transactions are added of removed

to/from the original database, the sanitized version D of database D O will have the

same number of transactions as the original one. Therefore, it holds thatjDj=jD O j.

Using F D to denote the set of frequent itemsets in the sanitized database D, the goal

that must be accomplished for the hiding of all the sensitive itemsets in D can be

expressed as follows: 8S 2S min : S 2F D , where S min is the set of minimal sensitive

itemsets, formally introduced in Chapter 9.

It is trivial to prove that the aforementioned goal can be easily achieved for any

databaseD O , provided that sufficient distortion is introduced to the database 1 . How-

ever, in reality one would like to hide all sensitive itemsets while minimally affecting

the original database. As stated in Chapter 9, the minimum harm in the hiding of the

sensitive knowledge can be quantified as F 0 D = F D O S max , meaning that in an

optimal hiding scenario, the sanitized database D will contain all frequent itemsets

of D O except from the sensitive ones 2 . By only removing certain items from se-

lected transactions in D O , the inline algorithm ensures that database D will contain

no ghost itemsets, i.e. nonsensitive itemsets that did not appear among the frequent

ones in the original database. However, the exclusion of items from transactions

may lead to nonsensitive frequent itemsets being accidentally lost in the sanitized

database D. Thus, the inline algorithm has to take the appropriate measures to min-

imize the loss of frequent itemsets in D as a side-effect of the sanitization process.

As we presented in Chapter 13, Menon, et al. [47] employ the measure of accu-

racy in the optimization criterion of the produced CSP in order to derive the hiding

solution. Maximizing accuracy is equivalent to minimizing the number of transac-

tions from D O that need to be modified to accommodate the hiding of the sensitive

knowledge. The inline algorithm, on the other hand, takes a more radical approach

to quantifying the harm induced to the original database by penalizing the number

of actual item modifications rather than the number of transactions that are affected

by the hiding process. Indeed, a hiding algorithm may alter less transactions but

to a much higher degree than another one that probably alters more transactions

but to a much lower degree. A basic drawback of the accuracy measure is that it

penalizes all potential modifications of a frequent itemset, say for instance, both

'abcde' !? and 'abcde' ! 'abcd', to the same extent. Based on this mishap, a

1 This property along with a naïve approach for solving the association rule hiding problem, were

briefly discussed in Chapter 2 (Section 2.2.1).

2 We should point out that notation F 0 D refers to the optimal set of frequent itemsets that should

appear in the sanitized database D to facilitate exact knowledge hiding, while F D denotes the

actual set of frequent itemsets that can be mined from D.

Association Rule Hiding for Data Mining

Search WWH ::

Custom Search

Home