Database Reference
In-Depth Information
100
100
80
80
60
60
40
40
Precision
Recall
Approximation quality
Count answers
Sum answers
20
20
0
0
3
4
5
6
7
10
20
30
40
50
Parameter
Quantile parameter
(a) Equi-width histogram.
(d) Equi-depth histogram.
Fig. 7.15 Approximation quality.
matching probabilities are greater than 0, we can obtain the histogram answers as
shown in Figure 7.10.
The above three queries illustrate the difference between the answers to aggregate
queries on probabilistic linkages and the answers computed only on the linked pairs
with high confidence, which clearly shows the effectiveness of aggregate queries on
probabilistic linkages.
7.7.2 Results on Synthetic Data Sets
We evaluate the efficiency and the effectiveness of the query answering techniques
in synthetic data sets with different parameter settings.
A data set contains N l linkages in tables A and B containing N t tuples each. In
the corresponding PME-graph, the degree of a tuple denotes the number of other
tuples (maximal cliques) it connects to. The degree of each tuple follows the Normal
distribution N
. A data set contains N c connected components. We generate
the linkages as follows. First, for each tuple t A
t , σ t )
A , a set of linkages are generated
associating with t A . Then, for each tuple t B
B , we randomly pair the tuples in A
is created, all tuple t A
to t B . In order to avoid loops, once a linkage
A
that are in the same connected component with t A cannot be assigned to t B . The
membership probability of each linkage is randomly assigned and normalized so
that the probability of each tuple is between
L (
t A ,
t B )
(
,
]
0
1
.
,
By default, a synthetic data set contains 20
000 linkages between tables A and
B with 5
,
000 tuples each. The degree of a tuple follows the Normal distribution
N (
1.
The parameter k for equi-depth histogram answer is set to 10. We set parameter
k
4
,
1
)
. The bucket width
η =
1000 and the minimum probability threshold
τ =
0
.
4.
First, we evaluate the efficiency and scalability of the query answering methods.
All algorithm developed are scalable when the number of linkages is as large as
100
=
1000 and p
=
0
.
000.
Figures 7.11(a) to (d) show the runtime of the query answering methods for
probabilistic threshold top- k queries in various parameter settings. Exact-topk is
,
 
Search WWH ::




Custom Search