Privacy-Preserving Data Mining: A Survey - Database Security: Applications and Trends

Databases Reference

In-Depth Information

3 The k -Anonymity Framework

The randomization method is a simple technique which can be easily imple-

mented at data collection time , because the noise added to a given record is

independent of the behavior of other data records. This is also a weakness be-

cause outlier records can often be dicult to mask. Clearly, in cases in which

the privacy-preservation does not need to be performed at data-collection

time, it is desirable to have a technique in which the level of inaccuracy de-

pends upon the behavior of the locality of that given record. Another key

weakness of the randomization framework is that it does not consider the

possibility that publicly available records can be used to identify the identity

of the owners of that record. In [10], it has been shown that the use of pub-

licly available records can lead to the privacy getting heavily compromised in

high-dimensional cases. This is especially true of outlier records which can be

easily distinguished from other records in their locality.

In many applications, the data records are made available by simply remov-

ing key identifiers such as the name and social-security numbers from personal

records. However, other kinds of attributes (known as pseudo-identifiers) can

be used in order to accurately identify the records. Foe example, attributes

such as age, zip-code and sex are available in public records such as census

rolls. When these attributes are also available in a given data set, they can be

used to infer the identity of the corresponding individual. A combination of

these attributes can be very powerful, since they can be used to narrow down

the possibilities to a small number of individuals.

In k -anonymity techniques [98], we reduce the granularity of representation

of these pseudo-identifiers with the use of techniques such as generalization

and suppression . In the method of generalization , the attribute values are

generalized to a range in order to reduce the granularity of representation.

For example, the date of birth could be generalized to a range such as year of

birth, so as to reduce the risk of identification. In the method of suppression ,

the value of the attribute is removed completely. It is clear that such methods

reduce the risk of identification with the use of public records, while reducing

the accuracy of applications on the transformed data.

In order to reduce the risk of identification, the k -anonymity approach

requires that every tuple in the table be indistinguishability related to no

fewer than k respondents. This can be formalized as follows:

Definition 1. Each release of the data must be such that every combination

of values of quasi-identifiers can be indistinguishably matched to at least k

respondents.

The first algorithm for k -anonymity was proposed in [98]. The approach uses

domain generalization hierarchies of the quasi-identifiers in order to build k -

anonymous tables. The concept of k -minimal generalization has been proposed

in [98] in order to limit the level of generalization for maintaining as much data

precision as possible for a given level of anonymity. Subsequently, the topic of

Database Security: Applications and Trends

Search WWH ::

Custom Search

Home