Privacy-Preserving Data Mining Techniques: Survey and Challenges - Discrimination and Privacy in the Information Society

Database Reference

In-Depth Information

distribution. Individual attribute values therefore bear no similarity to the original

values, e.g. salary values may become negative numbers. The distribution of such

randomized attribute is of course totally different from the original distribution of

a j , but (Agrawal and Srikant 2000) show how its distributional properties can be

reconstructed from the randomized distribution so that meaningful data mining

operations (e.g. classification, or association rules) can be performed. This is illu-

strated in Fig. 2. The results of these data mining operations are close to the results

obtained on the original data. This approach of (Agrawal and Srikant 2000) has

been shown to be prone to an attack, using sophisticated control theory methods

(Kargupta, Datta et al. 2003). Another perturbative approach introduces multiplic-

ative rather than additive noise in the data (Kun, Kargupta et al. 2006), with priva-

cy guarantees stronger than those given by additive noise.

Fig. 11.2 The effects of the perturbation the distribution of the attribute being perturbed, to

protect the disclosure of its values (from (Aggarwal and Yu 2008)). The curve labeled

“randomized” has a distribution very different from the original one. Data mining is per-

formed on the reconstructed distribution, obtained using the algorithm described in the pa-

per. The reconstructed distribution close is to the original, and therefore the results of data

mining are close to what they would have been on the original, confidential.

Perturbative methods have the disadvantage of modifying the original data,

which can be difficult to accept in certain classes of applications, e.g. in working

with medical data.

A different kind of perturbative approach is known as rank swapping (Nin,

Herranz et al. 2008). The main idea is to swap values of a given attribute among

records in a dataset. The swapping is controlled by the distance between the

swapped values - values that are close are more likely to be swapped. The advan-

tage of this approach is that, unlike with the noise-injecting approaches described

above, the entire set of values of a given attribute and its distribution are pre-

served. The disadvantage is that potentially implicit relationships between values

of attributes can be broken.

Search WWH ::

Custom Search

Home