Privacy-Preserving Data Mining Techniques: Survey and Challenges - Discrimination and Privacy in the Information Society

Database Reference

In-Depth Information

questions awaiting solutions and the forthcoming challenges. For a more technical

and a more complete presentation, the reader may consult (Vaidya, Zhu et al.

2006), or more recent, in-depth technical tutorials (Fung, Wang et al. 2010),

(Chen, Kifer et al. 2009).

Data privacy is often seen as an aspect of, or appendix to, data security. This is

not a correct view, as the goals of the two fields are divergent. On the one hand,

security protects the data against unauthorized access, e.g. reading the data while

it is transmitted across a network. But once the data reaches an authorized reci-

pient, security does not impose additional constraints having to do with revealing

personal information of an individual. This is, on the other hand, the goal of data

privacy. Such divergence of goals is well illustrated by public key cryptography

that protect the data encrypted using a person's private key, but also make the data

tightly linked to an individual whose public key is used to decipher it, thereby

identifying that individual. It is therefore correct to describe the relationship be-

tween data security and data privacy as the former being a prerequisite of the lat-

ter. Data must be protected in storage and transmission by data security methods

(e.g. with cryptographic techniques), but if data privacy is a goal, then additional

steps, some of them described below, must be taken to protect privacy of the indi-

viduals represented in the data.

Before reviewing current work in PPDM, we need to establish dimensions that

will structure this review. In order to identify those dimensions, we need to ground

the discussion in the process that PPDM addresses, mainly sharing data and results

of a data mining operation between users u 1 ,…u m , m≥ 2 . Furthermore, it is useful

to view the data as a database of n records, each consisting of l fields, where each

record represents an individual i i , and describes i i in terms of its fields. The usual

simplified representation is a table T , in which rows represent individuals i 1 ,…i n ,

and columns - referred to as attributes - represent the fields a 1 ,…a l . This assumes

a fixed representation, i.e. each individual is represented by a vector of values of

a 1 ,…a l .

For a holistic view of PPDM, the first useful dimension is to consider privacy

in terms of what is being protected, or conversely - what does an attacker want to

obtain from T. The second useful dimension is the ownership structure of the data

- does it belong to one entity and has to be shared with another entity ( m = 2) or is

it built from parts owned by different entities? We therefore propose to consider

the following dimensions:

•

What is being protected:

o

the data: an attacker, given T ,



will not be able to link any row in T to a specific i

[ identity disclosure ]



will not be able to obtain a value a ij of a sensitive

attribute a j of i i [ attribute disclosure ]

o

the inferred data mining result: an attacker, not knowing T but

given the results of the data mining operation, e.g. an association

rule learned from T , will be able to identify some attributes of a

specific i i [ model-based identity disclosure ]

Discrimination and Privacy in the Information Society

Search WWH ::

Custom Search

Home