Database Reference
In-Depth Information
is mistakenly classified as rare, or the converse, where a rare itemset is claimed to
be frequent.
The Support Error ( μ ) metric reflects the average relative error (in percent) of
the reconstructed support values for those itemsets that are correctly identified to be
frequent. Denoting the number of frequent itemsets by
|
F
|
, the reconstructed support
by
sup and the actual support by sup , the support error is computed over all frequent
itemsets as
f F |
sup f
sup f
|
1
μ
=
100
|
F
|
sup f
The Identity Error ( σ ) metric, on the other hand, reflects the percentage error in
identifying frequent itemsets and has two components: σ + , indicating the percentage
of false positives, and σ indicating the percentage of false negatives. Denoting the
reconstructed set of frequent itemsets with R and the correct set of frequent itemsets
with F , these metrics are computed as follows
σ + = |
R
F
|
σ = |
F
R
|
100
100
|
F
|
|
F
|
Note that in some papers (e.g. [ 57 ]), the accuracy metrics are taken to be the worst-
case, rather than average-case, versions of the above errors.
2.2
Evolution of the Literature
From the database perspective, the field of privacy-preserving data mining was
catalyzed by the pioneering investigation of [ 6 ]. In that work, developing privacy-
preserving data classifiers by adding noise to the record values was proposed and
analyzed. This approach was extended in [ 3 ] and [ 29 ] to address a variety of subtle
privacy loopholes.
Concurrently, the research community also began to look into extending privacy-
preserving techniques to alternative mining patterns, such as association rules,
clustering, etc. For association rules, two main streams of literature emerged, as
mentioned earlier, one looking at providing input data privacy, and the other con-
sidering the protection of sensitive output rules (discussed in Sect. 3). An important
point to note here is that unlike the privacy-preserving classifier approaches, which
were based on adding a noise component to continuous-valued data, the privacy-
preserving techniques in ARM are based on probabilistic mapping from the domain
space to the range space, over categorical atttributes.
With regard to input data privacy, the early papers include [ 17 , 45 ], which proposed
the MASK algorithm and the Cut-and-Paste operators, respectively.
MASK In MASK [ 45 ], a simple probabilistic distortion of user data, employing
random numbers generated from a pre-defined distribution function, was proposed
and evaluated in the context of sparse boolean databases, such as those found in
Search WWH ::




Custom Search