Privacy Issues in Association Rule Mining - Frequent Pattern Mining

Database Reference

In-Depth Information

is mistakenly classified as rare, or the converse, where a rare itemset is claimed to

be frequent.

The Support Error ( μ ) metric reflects the average relative error (in percent) of

the reconstructed support values for those itemsets that are correctly identified to be

frequent. Denoting the number of frequent itemsets by

|

F

|

, the reconstructed support

by

sup and the actual support by sup , the support error is computed over all frequent

itemsets as

f ∈ F |

sup f −

sup f

|

1

μ

=

∗

100

|

F

|

sup f

The Identity Error ( σ ) metric, on the other hand, reflects the percentage error in

identifying frequent itemsets and has two components: σ + , indicating the percentage

of false positives, and σ − indicating the percentage of false negatives. Denoting the

reconstructed set of frequent itemsets with R and the correct set of frequent itemsets

with F , these metrics are computed as follows

σ + = |

R

−

F

|

σ − = |

F

−

R

|

∗

100

∗

100

|

F

|

F

|

Note that in some papers (e.g. [ 57 ]), the accuracy metrics are taken to be the worst-

case, rather than average-case, versions of the above errors.

2.2

Evolution of the Literature

From the database perspective, the field of privacy-preserving data mining was

catalyzed by the pioneering investigation of [ 6 ]. In that work, developing privacy-

preserving data classifiers by adding noise to the record values was proposed and

analyzed. This approach was extended in [ 3 ] and [ 29 ] to address a variety of subtle

privacy loopholes.

Concurrently, the research community also began to look into extending privacy-

preserving techniques to alternative mining patterns, such as association rules,

clustering, etc. For association rules, two main streams of literature emerged, as

mentioned earlier, one looking at providing input data privacy, and the other con-

sidering the protection of sensitive output rules (discussed in Sect. 3). An important

point to note here is that unlike the privacy-preserving classifier approaches, which

were based on adding a noise component to continuous-valued data, the privacy-

preserving techniques in ARM are based on probabilistic mapping from the domain

space to the range space, over categorical atttributes.

With regard to input data privacy, the early papers include [ 17 , 45 ], which proposed

the MASK algorithm and the Cut-and-Paste operators, respectively.

MASK In MASK [ 45 ], a simple probabilistic distortion of user data, employing

random numbers generated from a pre-defined distribution function, was proposed

and evaluated in the context of sparse boolean databases, such as those found in

Frequent Pattern Mining

Search WWH ::

Custom Search

Home