Databases Reference
In-Depth Information
Such a tradeoff has been studied in [30] where authors develop a strategy to
minimize the disclosure with constraint on the performance degradation 6 .
Let us take the case where bucket based generalization is performed over
a single dimensional ordered data set, e.g., a numeric attribute and the query
class is that of 1-dimensional range queries. The authors in [30] propose en-
tropy and variance of the value distributions in the bucket as appropriate
measure of (the inverse) disclosure risk. Entropy captures the notion of uncer-
tainty associated with a random element chosen with a probability that follows
a certain distribution. The higher the value of entropy of a distribution (i.e.,
larger the number of distinct values and more uniform the frequencies, larger
is the value of the entropy), greater is the uncertainty regarding the true value
of the element. For example, given a domain having 5 distinct values and the
data set having 20 data points, the entropy is maximized if all 5 values appear
equal number of time, i.e. each value has a frequency of 4.
In a secure index-based scheme, the adversary only sees the bucket label B
of a data element t . Therefore, if the adversary (somehow) learns the distribu-
tion (frequencies) of values within B , he can guess the true value (say v )of
t with a probability equal to the fractional proportion of elements with value
v within the bucket. The notion of uncertainty regarding the true value can
be captured in an aggregate manner by the entropy of the value distribution
within B . Entropy of a discrete random variable X taking values x i =1 ,...,n
with corresponding probabilities p i ,i =1 ,...,n is given by:
n
Entropy ( X )= H ( X )=
p i log 2 ( p i )
i =1
If the domain of the attribute has an order defined on it as in the case
of a numeric attribute, the above definition of entropy does not capture the
notion of distance between two values. In the worst case model, since the value
distribution is assumed to be known to the adversary, greater the spread of
each bucket distribution, better is the protection against disclosure. Therefore,
the authors propose variance of the bucket distribution as the second (inverse)
measure of disclosure risk associated with each bucket. That is, higher the
variance of the value distribution, lower is the disclosure risk.
n
n
E ( X )) 2 , where E ( X )= 1
n
Variance ( X )=
p i ( x i
p i x i
i =1
i =1
After specifying these measures of disclosure-risk, [30] propose a 2-phase
algorithm for creating the secure indices. The goal is to provide the data owner
6 Notice the dual of the problem maximize performance with a constraint on
information disclosure would also be addressed once we agree on the metric for
information disclosure. However, such an articulation of the problem has not been
studied in the literature.
 
Search WWH ::




Custom Search