Privacy-Preserving Data Mining: A Survey - Database Security: Applications and Trends

Databases Reference

In-Depth Information

tion 4, we will discuss issues in distributed privacy-preserving data mining. In

section 5, we will discuss a number of techniques for privacy which arise in the

context of sensitive output of a variety of data mining and data management

applications. In section 6, we will discuss some unique challenges associated

with privacy in the high dimensional case. Section 7 contains the conclusions

and discussions.

2 The Randomization Method

In this section, we will discuss the randomization method for privacy-preserving

data mining. The randomization method has been traditionally used in the

context of distorting data by probability distribution for methods such as sur-

veys which have an evasive answer bias because of privacy concerns [69, 111].

This technique has also been extended to the problem of privacy-preserving

data mining [2].

The method of randomization can be described as follows. Consider a set

of data records denoted by X =

X ,weadd

a noise component which is drawn from the probability distribution f Y ( y ).

These noise components are drawn independently, and are denoted y 1 ...y N .

Thus, the new set of distorted records are denoted by x 1 + y 1 ...x N + y N .

We denote this new set of records by z 1 ...z N . In general, it is assumed that

the variance of the added noise is large enough, so that the original record

values cannot be easily guessed from the distorted data. Thus, the original

records cannot be recovered, but the distribution of the original records can

be recovered.

Thus, if X be the random variable denoting the data distribution for the

original record, Y be the random variable describing the noise distribution,

and Z be the random variable denoting the final record, we have:

{

x 1 ...x N }

. For record x i ∈

Z = X + Y

X = Z

−

Y

Now, we note that N instantiations of the probability distribution Z are

known, whereas the distribution Y is known publicly. For a large enough

number of values of N , the distribution Z can be approximated closely by

using a variety of methods such as kernel density estimation. By subtracting

Y from the approximated distribution of Z , it is possible to approximate the

original probability distribution X . In practice, one can combine the process

of approximation of Z with subtraction of the distribution Y from Z by using

a variety of iterative methods such as those discussed in [2, 5]. Such itera-

tive methods typically have a higher accuracy than the sequential solution of

first approximating Z and then subtracting Y from it. In particular, the EM

method proposed in [5] shows a number of optimal properties in approximat-

ing the distribution of X .

Database Security: Applications and Trends

Search WWH ::

Custom Search

Home