Databases Reference
In-Depth Information
4 Generalization-Based Publishing
The concept of anonymization by generalization [23, 24] was introduced to
enable the publishing of data about individuals for the purpose of studies
(e.g. computing statistics and data mining), while making it hard to pinpoint
the exact individual associated with each data value. A canonical example
pertains to a hospital that publishes seemingly anonymized data by releasing
the age, gender and zip code of its patients together with the disease, in the
hope that by leaving out the name and social security number attackers cannot
infer who suffers from what disease.
Sweeney shows that this hope is unfounded [24], as over 85% of the US
population is identified by the combination of age, gender and zip. This data
is accessible to attackers either because they know the person, or simply from
publicly available databases such as voter registration lists. In a notorious
illustration of her point, Sweeney uncovered the medical history of a former
governor of Massachusetts by combining the medical data with the registration
list.
The attacks based on combining the anonymized data with external public
databases are called linking attacks . Sweeney argues that in order to defend
against linking attacks, the data owner must conservatively assume that the
attacker has access to the public database, and that the information in this
database uniquely identifies the individual. The upshot of this assumption is
that the attacker has access to the identity of each individual, as if the owner
had published it. Therefore, the best a defense against linking attacks can
accomplish is to hide the association between the individual's identity and
the sensitive data (such as her disease, salary, etc.).
In detail, work on anonymization by generalization considers a database
containing a single relation R ( ID,QI,S ), where
the list of attributes ID comprises the person's identifier
(e.g. (ssn) or (first name, middle name, last name)),
the list of attributes QI gives the person's quasi-identifier
(e.g. (age,gender,zip)) which can be used to look up the actual identifier
in some public database of schema ID,QI ,and
S is the list of sensitive attributes (e.g. disease, salary, etc.).
Association between identity and sensitive attributes. We say that
identity id is associated in R to sensitive attribute value s if there exists some
tuple r
R with r [ ID ]= id and r [ S ]= s .
Generalization function. To keep associations private, the owner anon-
ymizes the QI attributes using a generalization function g . g hides the actual
values of the QI attributes, replacing them with more general values. For
instance, an age value is replaced by an age interval, a zip code changed by
dropping some of its least significant digits. In the extreme, the generalization
function can hide the attribute value completely by replacing it with the wild
card “*”. This is called attribute suppression .
Search WWH ::




Custom Search