Database Reference
In-Depth Information
Table 1. Different perturbation-based techniques
Data swapping
Data is exchanged between different records; individual information is thus protected
while calculated statistics are not impacted.
Random sample queries
A set of answers to a specific query are created dynamically by selecting a random
subset instead of all data item. This approach works well only for large datasets.
Fixed perturbation
Data is modified—not swapped—as soon as it is loaded into the data warehouse.
Query-based perturbation
Data is modified for each query dynamically. The advantage is that the accuracy
can be varied individually depending on the user's trustworthiness.
mation, Denning (1979) as well as Denning and
Schlorer (1983) show how general trackers can be
used without in-depth background knowledge.
A very simple example illustrates how infer-
ence causes information leakage. If it is known
that Alice is the oldest person but her age is
unknown, repeatedly asking “How many people
are older than X years? ” with different values of
X until the database returns the value 1, allows
inference of Alice's age. By enforcing that each
query returns aggregated data of more than one
rows will not solve the problem. Repeatedly que-
rying “How many employees are older than X ?
” until the system rejects the query because the
query would return less than N rows, identifies a
minimum set. This set includes N +1 employees,
including Alice, w are older than X ; let X =66
at this point. Subsequently, a query “Retrieve
the sum of ages of all employees who are older
than X ? ” will return a result R 1. The last query
“Retrieve the sum of ages of all employees who
are not called Alice and are older than X ? ” will
return R 2. Finally, subtract R 2 from R 1 to obtain
Alice's age. The example includes a query “not
called Alice” that excludes a single item. If the
“not equal” operation would not be allowed, a
binary search could still be used to exclude a
single item with a comparison operator. Simple
control of result sizes as described here are not
designed to prevent such an exclusion.
In audit-based expanded query set size control
aka. Nabil and Worthmann's (1989) 'query set
overlap control' the system decided whether to
grant access to an “assumed information base,”
which is the history of all the requests issued by
the user. The assumed information base contains
all possible inferences that can be generated with
the results of all previously issued queries; before
answering a new query the system has to decide
whether the query could be combined with the
assumed information base to infer confidential
information.
Perturbation-based techniques (cf. Table 1)
are characterized by modifying the data so that
the privacy of individuals can still be guaran-
teed even if more detailed data is returned than
in restriction-based techniques. Data can be
modified in the original data or in the results
returned.
According to Samarati and Sweeney (1998) k-
anonymity refers to a concept that guarantees that
data of an individual will remain indistinguishable
from that of at least k -1 others. The basic idea to
protect privacy is centered on quasi-identifiers.
Quasi-identifiers are usually a combination of data
items that probably allow an identification of a
person such as birth date, ZIP code, and gender.
The idea, explained by Sweeney (2002), is that
the data provider knows which data is externally
available, for example a list of people with their
names, birth date, ZIP code and gender. High
dimensionality, however, may cause problems.
Charu (2005) points out that once the number of
dimensions increases to about 20, even 2-anonym-
ity cannot be preserved in most cases without
losing too much original information.
K-anonymity can be attacked using the ho-
mogeneity attack or the background knowledge
Search WWH ::




Custom Search