Hybrid Algorithm - Association Rule Hiding for Data Mining

Database Reference

In-Depth Information

16.1.2 A Running Example

To shed light into the workings of the hybrid algorithm, in what follows we borrow

the running example of [26]. Suppose that we are provided with database D O of

Table 16.1 . Applying frequent itemset mining in D O using mfreq = 0:3, leads to the

set of large itemsets F D O appearing in the upper part of Table 16.2. Among these

itemsets, S =fe; ae; bcg denotes the sensitive knowledge that has to be protected.

The hybrid hiding algorithm aims at the creation of a database extension D X to D O

(see Table 16.1 ) that allows the hiding of the sensitive knowledge, while keeping the

nonsensitive patterns frequent in the sanitized outcome.

Table 16.1 summarizes the target of the hybrid hiding algorithm. The union of

the two datasets, D O and D X , corresponds to the sanitized outcome D that can

be safely released. Thus, the primary goal of the hiding algorithm is to construct

the privacy aware extension D X of D O such that (i) it contains the least amount

of transactions that are necessary to ensure the proper hiding of all the sensitive

knowledge in D O , and (ii) it introduces no side-effects to the hiding process. As one

can observe in the lower part of Table 16.2, all the sensitive itemsets of D O along

with their supersets are infrequent in D (shown under the dashed line), while all

the nonsensitive itemsets of D O remain frequent. Since D O is extended, in order to

ensure that the nonsensitive patterns will remain frequent in D, the hiding algorithm

needs to appropriately increase their support in the sanitized database.

16.2 Main Issues Pertaining to the Hiding Methodology

The hybrid hiding methodology produces a sanitized database D that corresponds

to a mixture of the original transactions in D O and a set of synthetic transactions,

artificially created to prohibit the leakage of the sensitive knowledge. For security

reasons, all the transactions in D are assumed to be randomly ordered so that it is

difficult for an adversary to distinguish between the real ones and those that were

added by the hiding algorithm to secure the sensitive knowledge. There are several

issues of major importance, involving the hiding methodology, that need to be ex-

amined. To continue, let P denote the size of database D, N is the size of database

D O , and Q is the size of the extension D X .

16.2.1 Size of Database Extension

Since database D O is extended by D X to construct database D, an initial and very

important step in the hiding process is the computation of the size of D X (i.e., the

necessary amount of transactions that need to be added to those of D O to facilitate

the hiding of the sensitive knowledge). A lower bound on this value can be estab-

lished based on the sensitive itemset in S which has the highest support (breaking

Search WWH ::

Custom Search

Home