Privacy-Preserving Data Mining: A Survey - Database Security: Applications and Trends

Databases Reference

In-Depth Information

technique in [7] analyzes the k -anonymity method in the presence of increas-

ing dimensionality. The curse of dimensionality becomes especially important

when adversaries may have considerable background information, as a result

of which the boundary between pseudo-identifiers and sensitive attributes may

become blurred. This is generally true, since adversaries may be familiar with

the subject of interest and may have greater information about them than

what is publicly available. This is also the motivation for techniques such as

l -diversity [77] in which background knowledge can be used to make further

privacy attacks. The work in [7] concludes that in order to maintain privacy,

a large number of the attributes may need to be suppressed. Thus, the data

loses its utility for the purpose of data mining algorithms. The broad in-

tuition behind the result in [7] is that when attributes are generalized into

wide ranges, the combination of a large number of generalized attributes is so

sparsely populated, that even two anonymity becomes increasingly unlikely.

While the method of l -diversity has not been formally analyzed, some obser-

vations made in [77] seem to suggest that the method becomes increasingly

infeasible to implement effectively with increasing dimensionality.

The method of randomization has also been analyzed in [10]. This pa-

per makes a first analysis of the ability to re-identify data records with

the use of maximum likelihood estimates. Consider a d -dimensional record

X =( x 1 ...x d ), which is perturbed to Z =( z 1 ...z d ). For a given public

record W =( w 1 ...w d ), we would like to find the probability that it could

have been perturbed to Z using the perturbing distribution f Y ( y ). If this were

true, then the set of values given by ( Z

w d ) should be

all drawn from the distribution f Y ( y ). The corresponding log-likelihood fit is

given by

−

W )=( z 1 −

w 1 ...z d −

− i =1 log( f y ( z i −

w i )). The higher the log-likelihood fit, the greater

the probability that the record W corresponds to X . In order to achieve

greater anonymity, we would like the perturbations to be large enough, so

that some of the spurious records in the data have greater log-likelihood fit to

Z than the true record X . It has been shown in [10], that this probability re-

duces rapidly with increasing dimensionality for different kinds of perturbing

distributions. Thus, the randomization technique also seems to be susceptible

to the curse of high dimensionality.

We note that the problem of high dimensionality seems to be a fundamen-

tal one for privacy preservation, and it is unlikely that more effective methods

can be found in order to preserve privacy when background information about

a large number of features is available to even a subset of selected individuals.

Indirect examples of such violations occur with the use of trail identifications

[78, 79], where information from multiple sources can be compiled to create a

high dimensional feature representation which violates privacy.

Database Security: Applications and Trends

Search WWH ::

Custom Search

Home