Privacy-Preserving Data Mining: A Survey - Database Security: Applications and Trends

Databases Reference

In-Depth Information

2.2 Adversarial Attacks on Randomization

In the earlier section on privacy quantification, we illustrated an example in

which the reconstructed distribution on the data can be used in order to reduce

the privacy of the underlying data record. In general, a systematic approach

can be used to do this in multi-dimensional data sets with the use of spectral

filtering or PCA based techniques [50, 62]. The broad idea in techniques such

as PCA [50] is that the correlation structure in the original data can be

estimated fairly accurately (in larger data sets) even after noise addition. Once

the broad correlation structure in the data has been determined, one can then

try to remove the noise in the data in such a way that it fits the aggregate

correlation structure of the data. It has been shown that such techniques can

reduce the privacy of the perturbation process significantly since the noise

removal results in values which are fairly close to their original values [50, 62].

Some other discussions on limiting breaches of privacy in the randomization

method may be found in [43].

A second kind of adversarial attack is with the use of public information.

Consider a record X =( x 1 ...x d ), which is perturbed to Z =( z 1 ...z d ).

Then, since the distribution of the perturbations is known, we can try to use

a maximum likelihood fit of the potential perturbation of Z to a public record.

Consider the publicly public record W =( w 1 ...w d ). Then, the potential per-

turbation of Z with respect to W is given by ( Z

−

W )=( z 1 −

w 1 ...z d −

w d ).

Each of these values ( z i −

w i ) should fit the distribution f Y ( y ). The corre-

sponding log-likelihood fit is given by

− i =1 log( f y ( z i −

w i )). The higher the

log-likelihood fit, the greater the probability that the record W corresponds

to X . If it is known that the public data set always includes X , then the

maximum likelihood fit can provide a high degree of certainty in identifying

the correct record, especially in cases where d is large. We will discuss this

issue in greater detail in a later section.

2.3 Randomization Methods for Data Streams

The randomization approach is particularly well suited to privacy-preserving

data mining of streams, since the noise added to a given record is indepen-

dent of the rest of the data. However, streams provide a particularly vul-

nerable target for adversarial attacks with the use of PCA based techniques

[50] because of the large volume of the data available for analysis. In [73],

an interesting technique for randomization has been proposed which uses the

auto-correlations in different time series while deciding the noise to be added

to any particular value. It has been shown in [73] that such an approach

is more robust since the noise correlates with the stream behavior, and it is

more dicult to create effective adversarial attacks with the use of correlation

analysis techniques.

Database Security: Applications and Trends

Search WWH ::

Custom Search

Home