Databases Reference
In-Depth Information
We note that at the end of the process, we only have a distribution con-
taining the behavior of X . Individual records are not available. Furthermore,
the distributions are available only along individual dimensions. Therefore,
new data mining algorithms need to be designed to work with the uni-variate
distributions rather than the individual records. This can sometimes be a
challenge, since many data mining algorithms are inherently dependent on
statistics which can only be extracted from either the individual records or
the multi-variate probability distributions associated with the records. While
the approach can certainly be extended to multi-variate distributions, density
estimation becomes inherently more challenging [100] with increasing dimen-
sionalities. For even modest dimensionalities such as 7 to 10, the process of
density estimation becomes increasingly inaccurate, and falls prey to the curse
of dimensionality.
One key advantage of the randomization method is that it is relatively
simple, and does not require knowledge of the distribution of other records
in the data. This is not true of other methods such as k -anonymity which
require the knowledge of other records in the data. Therefore, the randomiza-
tion method can be implemented at data collection time , and does not require
the use of a trusted server containing all the original records in order to per-
form the anonymization process. While this is a strength of the randomization
method, it also leads to some weaknesses, since it treats all records equally
irrespective of their local density. Therefore, outlier records are more suscep-
tible to adversarial attacks as compared to records in more dense regions in
the data [10]. In order to guard against this, one may need to be needlessly
more aggressive in adding noise to all the records in the data. This reduces
the utility of the data for mining purposes.
The randomization method has been extended to a variety of data min-
ing problems. In [2], it was discussed how to use the approach for classifica-
tion. A number of other techniques [124, 126] have also been proposed which
seem to work well over a variety of different classifiers. Techniques have also
been proposed for privacy-preserving methods of improving the effectiveness
of classifiers. For example, the work in [47] proposes methods for privacy-
preserving boosting of classifiers. Methods for privacy-preserving mining of
association rules have been proposed in [44, 95]. The problem of association
rules is especially challenging because of the discrete nature of the attributes
corresponding to presence or absence of items. In order to deal with this issue,
the randomization technique needs to be modified slightly. Instead of adding
quantitative noise, random items are dropped or included with a certain prob-
ability. The perturbed transactions are then used for aggregate association
rule mining. This technique has shown to be extremely effective in [44]. The
randomization approach has also been extended to other applications such as
OLAP [3], and SVD based collaborative filtering [91].
Search WWH ::




Custom Search