Databases Reference
In-Depth Information
tion 4, we will discuss issues in distributed privacy-preserving data mining. In
section 5, we will discuss a number of techniques for privacy which arise in the
context of sensitive output of a variety of data mining and data management
applications. In section 6, we will discuss some unique challenges associated
with privacy in the high dimensional case. Section 7 contains the conclusions
and discussions.
2 The Randomization Method
In this section, we will discuss the randomization method for privacy-preserving
data mining. The randomization method has been traditionally used in the
context of distorting data by probability distribution for methods such as sur-
veys which have an evasive answer bias because of privacy concerns [69, 111].
This technique has also been extended to the problem of privacy-preserving
data mining [2].
The method of randomization can be described as follows. Consider a set
of data records denoted by X =
X ,weadd
a noise component which is drawn from the probability distribution f Y ( y ).
These noise components are drawn independently, and are denoted y 1 ...y N .
Thus, the new set of distorted records are denoted by x 1 + y 1 ...x N + y N .
We denote this new set of records by z 1 ...z N . In general, it is assumed that
the variance of the added noise is large enough, so that the original record
values cannot be easily guessed from the distorted data. Thus, the original
records cannot be recovered, but the distribution of the original records can
be recovered.
Thus, if X be the random variable denoting the data distribution for the
original record, Y be the random variable describing the noise distribution,
and Z be the random variable denoting the final record, we have:
{
x 1 ...x N }
. For record x i
Z = X + Y
X = Z
Y
Now, we note that N instantiations of the probability distribution Z are
known, whereas the distribution Y is known publicly. For a large enough
number of values of N , the distribution Z can be approximated closely by
using a variety of methods such as kernel density estimation. By subtracting
Y from the approximated distribution of Z , it is possible to approximate the
original probability distribution X . In practice, one can combine the process
of approximation of Z with subtraction of the distribution Y from Z by using
a variety of iterative methods such as those discussed in [2, 5]. Such itera-
tive methods typically have a higher accuracy than the sequential solution of
first approximating Z and then subtracting Y from it. In particular, the EM
method proposed in [5] shows a number of optimal properties in approximat-
ing the distribution of X .
Search WWH ::




Custom Search