Privacy-Preserving Data Mining Techniques: Survey and Challenges - Discrimination and Privacy in the Information Society

Database Reference

In-Depth Information

•

Is the data

o

centralized: T is owned by one party u i , and is to be shared with

another party (or parties) u k , e.g. so that u k can perform a data

mining operation on T ?

o

or distributed: each u i knows only certain rows (or columns) of

T, but all of u i 's need a result of a data mining operation per-

formed on the whole T ?

In remainder of this chapter, we will follow these taxonomical dimensions in our

review of the existing PPDM research.

We need to introduce some further definitions useful in the presentation of the

PPDM concepts. In particular, an explicit identifier is an attribute that allows di-

rect linking of an instance (a row in T ) to a person i , e.g. a knowing a cellular

phone number or a driver's license number will unambiguously link the row in T

in which this explicit identifier occurs to a person i . A quasi-identifier is a set of

attributes which individually are not explicit identifiers, but which jointly may link

a row in T to a specific person. For instance, (Sweeney 2002) shows that in the

United States the quasi-identifier triplet <date of birth, 5 digit postal code,

gender> uniquely identifies 87% of the population of the country. As a convinc-

ing application of this observation, using quasi-identifiers and combining a public

healthcare information dataset with a publicly available voters' list, Sweeney was

able to obtain health records of the governor of Massachusetts from a published

dataset of health records of all state employees in which only explicit identifiers

have been removed.

For the sake of completeness, it has to be mentioned that there can also be a so

called “membership” privacy attack: given a table T and an individual i , is i in T ?

We can observe that this is a form of an identity disclosure attack, in terms of the

PPDM dimensions proposed above.

11.2 Identity Disclosure

In general, the main PPDM identity protection methods draw on simple ideas

known to humans throughout history and amply presented in literature and film.

These paradigms can be described as “hiding in the crowd” and “camouflage”.

One “hiding in the crowd” approach to data privacy is k -anonymity. The k-

anonymity method (Sweeney 2001) (Ciriani, Capitani di Vimercati et al. 2007)

modifies the original data T to obtain T' such that for any quasi-identifier q that

can be built from attributes of T there are at least k instances in T' such that q

matches these instances. Datasets need to be generalized to satisfy k- anonymity.

See Fig. 1 for an example of k -anonymized data. Conceptually, such data genera-

lizations correspond to clustering of datasets, and to using clusters instead of the

original elements. These clusters can also be viewed as equivalence classes of

the attribute generalization. Clearly, generalizations cause deterioration of the

quality of the data as the original values of at least some attributes are lost.

k -anonymization can be therefore seen as a task of minimal data generalization of

Discrimination and Privacy in the Information Society

Search WWH ::

Custom Search

Home