Database Reference
In-Depth Information
Algorithm 6.4: The Extended
POTGGroupPackage
Reduce
(
key:Sub, val: List of tuples T
)
;
1 foreach
tup
(
s
,
p
,
o
)
T
do
2
set
p
in
locBitstet
;
3
add (
p
,
o
) to
tempMap
;
4
matchedList
=
match
(
locBitSet
,
ECList
) ;
5 f
(
matchedList
>1)
then
//Ambiguous TripleGroup
6
foreach
EC
matchedList
do
7
propM ap
cloneM ap
(
tempMap
,
EC.propList
);
8
emit
RDFMap(
Sub
,
EC
,
propM ap
) ;
else
//Perfect TripleGroup
9
emit
RDFMap
(
Sub
,
matchedList
[0],
tempMap
);
6.8.2.1 Setup and Testbed
The evaluation was conducted on a 10-node Hadoop cluster with BSBM-250k
data set (approximately 86M triples with 250k Products {22 GB}). Four queries
(
dq
0 to
dq
4) containing two star patterns are considered, with varying numbers
of repeated properties (from 0 to 4, respectively) in the second star subpattern.
Figure 6.14 shows the graph representation of queries
dq
0 and
dq
4 (black and gray
edges denote an arbitrary unique property and a repeated property, respectively).
The queries include the following DupPs:
dq
0 (none),
dq
1 (
publisher
),
dq
2 (
pub-
lisher
,
type
),
dq
3 (
publisher
,
type
,
label
), and
dq
4 (
publisher
,
type
,
label
,
date
). To
evaluate scalability with increasing size of data, four BSBM data sets were used—
BSBM-{250k, 500k, 750k, 1000k}, with data size ranging from BSBM-250k to
BSBM-1000k (22 to 86 GB).
6.8.2.2 Varying Number of Repeated Properties across a Query
Figure 6.15a shows the execution time and the number of bytes read from HDFS
using the three approaches. In general, SHARD results in highest execution time and
:type
:type
:publisher
:name
:date
:publisher
:name
:date
:type
:publisher
:name
:date
FIGURE 6.14
Graph representation of the example query
dq
0 and
dq
4.
Search WWH ::
Custom Search