Privacy-Preserving Data Mining: A Survey - Database Security: Applications and Trends

Databases Reference

In-Depth Information

We note that at the end of the process, we only have a distribution con-

taining the behavior of X . Individual records are not available. Furthermore,

the distributions are available only along individual dimensions. Therefore,

new data mining algorithms need to be designed to work with the uni-variate

distributions rather than the individual records. This can sometimes be a

challenge, since many data mining algorithms are inherently dependent on

statistics which can only be extracted from either the individual records or

the multi-variate probability distributions associated with the records. While

the approach can certainly be extended to multi-variate distributions, density

estimation becomes inherently more challenging [100] with increasing dimen-

sionalities. For even modest dimensionalities such as 7 to 10, the process of

density estimation becomes increasingly inaccurate, and falls prey to the curse

of dimensionality.

One key advantage of the randomization method is that it is relatively

simple, and does not require knowledge of the distribution of other records

in the data. This is not true of other methods such as k -anonymity which

require the knowledge of other records in the data. Therefore, the randomiza-

tion method can be implemented at data collection time , and does not require

the use of a trusted server containing all the original records in order to per-

form the anonymization process. While this is a strength of the randomization

method, it also leads to some weaknesses, since it treats all records equally

irrespective of their local density. Therefore, outlier records are more suscep-

tible to adversarial attacks as compared to records in more dense regions in

the data [10]. In order to guard against this, one may need to be needlessly

more aggressive in adding noise to all the records in the data. This reduces

the utility of the data for mining purposes.

The randomization method has been extended to a variety of data min-

ing problems. In [2], it was discussed how to use the approach for classifica-

tion. A number of other techniques [124, 126] have also been proposed which

seem to work well over a variety of different classifiers. Techniques have also

been proposed for privacy-preserving methods of improving the effectiveness

of classifiers. For example, the work in [47] proposes methods for privacy-

preserving boosting of classifiers. Methods for privacy-preserving mining of

association rules have been proposed in [44, 95]. The problem of association

rules is especially challenging because of the discrete nature of the attributes

corresponding to presence or absence of items. In order to deal with this issue,

the randomization technique needs to be modified slightly. Instead of adding

quantitative noise, random items are dropped or included with a certain prob-

ability. The perturbed transactions are then used for aggregate association

rule mining. This technique has shown to be extremely effective in [44]. The

randomization approach has also been extended to other applications such as

OLAP [3], and SVD based collaborative filtering [91].

Search WWH ::

Custom Search

Home