Database Reference
In-Depth Information
Is the data
o
centralized: T is owned by one party u i , and is to be shared with
another party (or parties) u k , e.g. so that u k can perform a data
mining operation on T ?
o
or distributed: each u i knows only certain rows (or columns) of
T, but all of u i 's need a result of a data mining operation per-
formed on the whole T ?
In remainder of this chapter, we will follow these taxonomical dimensions in our
review of the existing PPDM research.
We need to introduce some further definitions useful in the presentation of the
PPDM concepts. In particular, an explicit identifier is an attribute that allows di-
rect linking of an instance (a row in T ) to a person i , e.g. a knowing a cellular
phone number or a driver's license number will unambiguously link the row in T
in which this explicit identifier occurs to a person i . A quasi-identifier is a set of
attributes which individually are not explicit identifiers, but which jointly may link
a row in T to a specific person. For instance, (Sweeney 2002) shows that in the
United States the quasi-identifier triplet <date of birth, 5 digit postal code,
gender> uniquely identifies 87% of the population of the country. As a convinc-
ing application of this observation, using quasi-identifiers and combining a public
healthcare information dataset with a publicly available voters' list, Sweeney was
able to obtain health records of the governor of Massachusetts from a published
dataset of health records of all state employees in which only explicit identifiers
have been removed.
For the sake of completeness, it has to be mentioned that there can also be a so
called “membership” privacy attack: given a table T and an individual i , is i in T ?
We can observe that this is a form of an identity disclosure attack, in terms of the
PPDM dimensions proposed above.
11.2 Identity Disclosure
In general, the main PPDM identity protection methods draw on simple ideas
known to humans throughout history and amply presented in literature and film.
These paradigms can be described as “hiding in the crowd” and “camouflage”.
One “hiding in the crowd” approach to data privacy is k -anonymity. The k-
anonymity method (Sweeney 2001) (Ciriani, Capitani di Vimercati et al. 2007)
modifies the original data T to obtain T' such that for any quasi-identifier q that
can be built from attributes of T there are at least k instances in T' such that q
matches these instances. Datasets need to be generalized to satisfy k- anonymity.
See Fig. 1 for an example of k -anonymized data. Conceptually, such data genera-
lizations correspond to clustering of datasets, and to using clusters instead of the
original elements. These clusters can also be viewed as equivalence classes of
the attribute generalization. Clearly, generalizations cause deterioration of the
quality of the data as the original values of at least some attributes are lost.
k -anonymization can be therefore seen as a task of minimal data generalization of
Search WWH ::




Custom Search