Privacy Issues in Association Rule Mining - Frequent Pattern Mining

Database Reference

In-Depth Information

record of U may capture the items that were purchased together by an individual

from a supermarket (e.g., u 1 =

{bread, milk, sugar}). A similar representation that is

usually adopted by ARH algorithms is that of a boolean matrix, where each column

corresponds to an item from the domain of items

I

and each row is a transaction.

In this representation, a transaction of U has length

and has 1's in items that are

associated with it (e.g., purchased items) and 0's in the rest of the items.

Knowledge hiding, in the context of ARM, aims at sanitizing (transforming) the

original dataset in a way that the following goals are accomplished to the largest

possible extent:

| I |

a) Sensitive rules are concealed. No rule that is considered as sensitive from the

data owner's perspective, can be revealed from the sanitized dataset, when the

dataset is mined at pre-specified thresholds of confidence and support (or at any

value higher than these thresholds).

b) Frequent non-sensitive rules are preserved. All the non-sensitive frequent rules

can be successfully mined from the sanitized database at pre-specified thresholds

of confidence and support.

c) Ghost rules are not generated. No rule that was not mined from the original

dataset as frequent can be discovered from the sanitized database, when mining

this database at pre-specified thresholds of confidence and support.

d) Dataset distortion is minimum. The sanitized dataset is “as similar as possible”

to the original dataset, i.e., the number of data items that are affected by the hiding

process is kept minimum.

The first goal requires sensitive rules to disappear. The second goal simply states

that there should be no lost rules in the sanitized dataset. The third goal says that no

false rules should be produced as a side-effect of the sanitization process. The fourth

goal requires that the hiding process incurs minimal distortion to the original dataset.

Generally speaking, in the typical case hiding scenario, the sanitization process has

to be accomplished in a way that minimally affects the original dataset , preserves the

general patterns and trends , and successfully conceals all the sensitive knowledge .

3.2

Taxonomy of ARH Algorithms

In this section, we present a taxonomy of frequent itemset and association rule hiding

algorithms. To classify the various algorithms, we use a set of orthogonal dimensions.

As a first dimension, we consider whether the hiding algorithm uses the support or

the confidence of the rule to drive the hiding process. In this way we separate the

hiding algorithms into support -based and confidence -based.

The second dimension in the classification is related to the modification in the raw

data that is caused by the hiding algorithm. The two forms of modification comprise

the distortion and the blocking of the original values. Distortion is the process of

replacing 1's by 0's and 0's by 1's, while blocking refers to replacing original values

by question marks (unknowns) to confuse adversaries about the actual value.

Frequent Pattern Mining

Search WWH ::

Custom Search

Home