Databases Reference
In-Depth Information
We can model
t
1
using a probability density function (pdf)
G
t
1
:
DS
A,D
→
[0
,
1]. Specifically:
G
t
1
(
x
)=
1if
x
=(
t
1
[
A
]
,t
1
[
D
])
(12)
0 otherwise
where
x
is a 2D random variable in
DS
A,D
. Figure 2a demonstrates the pdf.
Assume that a researcher wants to re-construct an approximate pdf
˜
gen
t
1
of
t
1
from the generalized Table 3b. From her/his perspective,
t
1
[
A
]canbe
any value in the interval [21
,
60] with equality probability 1
/
40, but
t
1
[
D
]
must be pneumonia. Hence,
G
⎧
⎨
1
/
40 if
x
[
A
]
[21
,
60] and
x
[
D
] =pneumonia
∈
˜
gen
t
1
G
(
x
)=
(13)
⎩
0
otherwise
which is illustrated in Figure 2b.
Instead, suppose that the researcher re-constructs a pdf
˜
ana
t
1
from the
QIT and ST in Tables 4a and 4b. This time, s/he knows that
t
1
[
A
]mustbe
23 (since age is published directly), but
t
1
[
D
] can be pneumonia or dyspepsia
with 50% probability (the ST shows that half of the tuples in QI-group 1 are
associated with these two diseases, respectively). Therefore,
G
⎧
⎨
1
/
2if
x
= (23, pneumonia) or
x
= (23, dyspepsia)
˜
ana
t
1
G
(
x
)=
(14)
⎩
0
otherwise
as shown in Figure 2c. Obviously, the pdf approximated from the anatomized
tables is more accurate than that (Figure 2b) from the generalized table.
Towards a more rigorous comparison, given an approximate pdf
˜
G
t
1
(Equa-
tion 13 or 14), a natural way of quantifying its approximation quality is to
calculate its “
L
2
distance” from the actual pdf
G
t
1
(Equation 12):
˜
−G
t
1
(
x
)
2
G
t
1
(
x
)
.
(15)
x∈DS
A,D
˜
ana
t
1
The distance of
G
is 0.5, indeed significantly lower than the distance 22.5
of
˜
gen
t
1
. Although we focused on
t
1
, in the same way, it is easy to verify that
the anatomized tables permit better re-construction of the pdfs of all tuples
in Table 3a.
G
5 Summary
In this chapter, we studied two anonymization frameworks for privacy pre-
serving data publication: generalization and anatomy. Generally speaking,