Databases Reference
In-Depth Information
tuple ID
Age Sex Zipcode Disease
1 (Bob)
23
M
11000
pneumonia
2
27
M
13000
dyspepsia
3
35
M
59000
dyspepsia
4
59
M
12000
pneumonia
5
61
F
54000
flu
6
65
F
25000
gastritis
7 (Alice)
65
F
25000
flu
8
70
F
30000
bronchitis
(a) The microdata
tuple ID
Age
Sex
Zipcode
Disease
1
[21, 60] M [10001, 60000] pneumonia
2
[21, 60] M [10001, 60000] dyspepsia
3
[21, 60] M [10001, 60000] dyspepsia
4
[21, 60] M [10001, 60000] pneumonia
5
[61, 70]
F
[10001, 60000]
flu
6
[61, 70]
F
[10001, 60000]
gastritis
7
[61, 70]
F
[10001, 60000]
flu
8
[61, 70] F [10001, 60000] bronchitis
(b) A 2-diverse table
Table 3.
Another generalization example
4.1 Motivation
Although generalization preserves privacy, it often loses considerable infor-
mation in the microdata, which severely compromises the accuracy of data
analysis. We illustrate this by using the microdata in Table 3a and the 2-
diverse generalization in Table 3b. Assume that a researcher wants to derive
from this table an estimate for the following query:
A:
SELECT COUNT
(*)
FROM
Unknown-Microdata
WHERE
Disease
= 'pneumonia'
AND
Age <
=30
AND
Zipcode
IN
[10001
,
20000]
To illustrate how to process the query, Figure 1 shows a 2D space, where
the x-, y-dimensions are
Age
and
Zipcode
, respectively. Each point denotes
a tuple in the microdata of Table 3a. For example, the x-, y-coordinates of
point 1 equal the age and zipcode of tuple 1, respectively. Rectangle
R
1
(or
R
2
) is obtained from the generalized values in the first (or second) QI-group
in Table 3b. For instance, the x- (y-) projection of
R
1
is the generalized age
[20
,
60] (zipcode [10001, 60000]) of tuples 1-4. Query A is represented as the
shaded rectangle
Q
, whose projection on the x- (y-) dimension is decided by
the range condition
Age
20000).
Since the researcher sees only
R
1
and
R
2
(but not the points), s/he an-
swers query A in a way similar to selectivity estimation on a multidimensional
≤
30 (10001
≤
Zipcode
≤