Privacy-Preserving Data Mining Techniques: Survey and Challenges - Discrimination and Privacy in the Information Society

Database Reference

In-Depth Information

(a result of data generalization) satisfies t -closeness if the distance between the

distribution of a sensitive attribute in this cluster and the distribution of this

attribute in the whole table T is no more than a threshold t . In that manner

t- closeness may, in principle, prevent discrimination by making it impossible to

assert negative inferences about the sensitive attribute based on a cluster member-

ship, such that these inferences would be stronger than the ones for the entire table

(the whole population). It is clear, however, that requiring t- closeness imposes a

very strong constraint on the generalization process, resulting in a potentially very

significant distortion of data, thereby decreasing the quality of the data (and any

model obtained from it) unacceptably.

It is worth observing that the attack model behind data k- anonymity is some-

what unrealistic. It assumes that the attacker has a total knowledge of all values of

the attributes for a given instance, which will normally not be the case. Starting

with this observation, more realistic models have been proposed. For instance, in

(Mohammed, Fung et al. 2009) the attack model assumes that the attacker's know-

ledge is limited to L quasi-identifiers, and the k- anonymization is limited to those

identifiers.

k -anonymization is often the method of choice in data publishing, particularly

for medical data. The reason is that, unlike other perturbative methods discussed

in the next section, the approach does not distort the data: even the generalized da-

ta is “true”, i.e. it represents true (even though possibly imprecise) statements

about the original data.

A completely different identity disclosure attack is possible when the model

build using data mining techniques such as classification or association rules is so

granular (on a specific data set) that it identifies a specific individual. Publishing

such model alone, even without access to data from which it has been obtained,

would then disclose data values that the model represents for that specific individ-

ual. Rule-hiding is an approach attempting to solve this problem. For instance,

(Verykios, Elmagarmid et al. 2004) present strategies prevening association rules

with a sensitive attribute in the consequent from being produced by the association

rule mining algorithms. Another approach to rule hiding is described in (Oliveira,

Zaïane et al. 2004). These strategies are based on reducing the support and

confidence of rules with such attributes in the consequent. (Atzori, Bonchi et al.

2008) show how such disclosure can be avoided by elegantly generalizing to mod-

els the concept of k- anonymity discussed above for the data.

11.3 Attribute Disclosure

A different set of methods protecting against disclosure of a value of sensitive

attribute are the perturbative methods. They implement the “camouflage” para-

digm. The seminal work in this area is due to (Agrawal and Srikant 2000). The

main idea is simple: an attribute (say, a j -th column in T ) is systematically

changed by adding to each a ij , i= 1… n , a value obtained from a probability

Search WWH ::

Custom Search

Home