Privacy in Database Publishing: A Bayesian Perspective - Database Security: Applications and Trends

Databases Reference

In-Depth Information

4 Generalization-Based Publishing

The concept of anonymization by generalization [23, 24] was introduced to

enable the publishing of data about individuals for the purpose of studies

(e.g. computing statistics and data mining), while making it hard to pinpoint

the exact individual associated with each data value. A canonical example

pertains to a hospital that publishes seemingly anonymized data by releasing

the age, gender and zip code of its patients together with the disease, in the

hope that by leaving out the name and social security number attackers cannot

infer who suffers from what disease.

Sweeney shows that this hope is unfounded [24], as over 85% of the US

population is identified by the combination of age, gender and zip. This data

is accessible to attackers either because they know the person, or simply from

publicly available databases such as voter registration lists. In a notorious

illustration of her point, Sweeney uncovered the medical history of a former

governor of Massachusetts by combining the medical data with the registration

list.

The attacks based on combining the anonymized data with external public

databases are called linking attacks . Sweeney argues that in order to defend

against linking attacks, the data owner must conservatively assume that the

attacker has access to the public database, and that the information in this

database uniquely identifies the individual. The upshot of this assumption is

that the attacker has access to the identity of each individual, as if the owner

had published it. Therefore, the best a defense against linking attacks can

accomplish is to hide the association between the individual's identity and

the sensitive data (such as her disease, salary, etc.).

In detail, work on anonymization by generalization considers a database

containing a single relation R ( ID,QI,S ), where

•

the list of attributes ID comprises the person's identifier

(e.g. (ssn) or (first name, middle name, last name)),

•

the list of attributes QI gives the person's quasi-identifier

(e.g. (age,gender,zip)) which can be used to look up the actual identifier

in some public database of schema ID,QI ,and

•

S is the list of sensitive attributes (e.g. disease, salary, etc.).

Association between identity and sensitive attributes. We say that

identity id is associated in R to sensitive attribute value s if there exists some

tuple r

R with r [ ID ]= id and r [ S ]= s .

Generalization function. To keep associations private, the owner anon-

ymizes the QI attributes using a generalization function g . g hides the actual

values of the QI attributes, replacing them with more general values. For

instance, an age value is replaced by an age interval, a zip code changed by

dropping some of its least significant digits. In the extreme, the generalization

function can hide the attribute value completely by replacing it with the wild

card “*”. This is called attribute suppression .

∈

Database Security: Applications and Trends

Search WWH ::

Custom Search

Home