Algebraic Optimization of RDF Graph Pattern Queries on MapReduce - Large Scale and Big Data: Processing and Management - page 213

Database Reference

In-Depth Information

Algorithm 6.4: The Extended POTGGroupPackage

Reduce ( key:Sub, val: List of tuples T ) ;

1 foreach tup ( s , p , o ) T do

2 set p in locBitstet ;

3 add ( p , o ) to tempMap ;

4 matchedList = match ( locBitSet , ECList ) ;

5 f ( matchedList >1) then

//Ambiguous TripleGroup

6

foreach EC

matchedList do

7

propM ap

cloneM ap ( tempMap , EC.propList );

8

emit

RDFMap( Sub , EC , propM ap ) ;

else

//Perfect TripleGroup

9

emit

RDFMap ( Sub , matchedList [0], tempMap );

6.8.2.1 Setup and Testbed

The evaluation was conducted on a 10-node Hadoop cluster with BSBM-250k

data set (approximately 86M triples with 250k Products {22 GB}). Four queries

( dq 0 to dq 4) containing two star patterns are considered, with varying numbers

of repeated properties (from 0 to 4, respectively) in the second star subpattern.

Figure 6.14 shows the graph representation of queries dq 0 and dq 4 (black and gray

edges denote an arbitrary unique property and a repeated property, respectively).

The queries include the following DupPs: dq 0 (none), dq 1 ( publisher ), dq 2 ( pub-

lisher , type ), dq 3 ( publisher , type , label ), and dq 4 ( publisher , type , label , date ). To

evaluate scalability with increasing size of data, four BSBM data sets were used—

BSBM-{250k, 500k, 750k, 1000k}, with data size ranging from BSBM-250k to

BSBM-1000k (22 to 86 GB).

6.8.2.2 Varying Number of Repeated Properties across a Query

Figure 6.15a shows the execution time and the number of bytes read from HDFS

using the three approaches. In general, SHARD results in highest execution time and

:type

:type

:publisher

:name

:date

:publisher

:name

:date

:type

:publisher

:name

:date

FIGURE 6.14

Graph representation of the example query dq 0 and dq 4.

Next Page

Large Scale and Big Data: Processing and Management

Search WWH ::

Custom Search

Home