Databases Reference
In-Depth Information
tuple ID Age Sex Zipcode Disease
1 (Bob)
23
M
11000
pneumonia
2
27
M
13000
dyspepsia
3
35
M
59000
dyspepsia
4
59
M
12000
pneumonia
5
61
F
54000
flu
6
65
F
25000
gastritis
7 (Alice)
65
F
25000
flu
8
70
F
30000
bronchitis
(a) The microdata
tuple ID
Age
Sex
Zipcode
Disease
1
[21, 60] M [10001, 60000] pneumonia
2
[21, 60] M [10001, 60000] dyspepsia
3
[21, 60] M [10001, 60000] dyspepsia
4
[21, 60] M [10001, 60000] pneumonia
5
[61, 70]
F
[10001, 60000]
flu
6
[61, 70]
F
[10001, 60000]
gastritis
7
[61, 70]
F
[10001, 60000]
flu
8
[61, 70] F [10001, 60000] bronchitis
(b) A 2-diverse table
Table 3. Another generalization example
4.1 Motivation
Although generalization preserves privacy, it often loses considerable infor-
mation in the microdata, which severely compromises the accuracy of data
analysis. We illustrate this by using the microdata in Table 3a and the 2-
diverse generalization in Table 3b. Assume that a researcher wants to derive
from this table an estimate for the following query:
A: SELECT COUNT (*) FROM Unknown-Microdata
WHERE Disease = 'pneumonia' AND Age < =30
AND Zipcode IN [10001 , 20000]
To illustrate how to process the query, Figure 1 shows a 2D space, where
the x-, y-dimensions are Age and Zipcode , respectively. Each point denotes
a tuple in the microdata of Table 3a. For example, the x-, y-coordinates of
point 1 equal the age and zipcode of tuple 1, respectively. Rectangle R 1 (or
R 2 ) is obtained from the generalized values in the first (or second) QI-group
in Table 3b. For instance, the x- (y-) projection of R 1 is the generalized age
[20 , 60] (zipcode [10001, 60000]) of tuples 1-4. Query A is represented as the
shaded rectangle Q , whose projection on the x- (y-) dimension is decided by
the range condition Age
20000).
Since the researcher sees only R 1 and R 2 (but not the points), s/he an-
swers query A in a way similar to selectivity estimation on a multidimensional
30 (10001
Zipcode
 
Search WWH ::




Custom Search