Large-Scale RDF Processing with MapReduce - Large Scale and Big Data: Processing and Management

Database Reference

In-Depth Information

The performance of MAPSIN joins strongly correlates with the number of index

lookups in HBase. Hence, minimizing the number of lookups is a crucial point for

optimization. In many situations, it is possible to reduce the number of requests by

leveraging the RDF schema design for HBase outlined in Section 5.5. If the join

variable for all triple patterns is always on subject or always on object position, then

all mappings for p 2 , p 3 ,…, p n+ 1 that are compatible to the input mapping μ 1 for p 1 are

stored in the same HBase table row of T s_po or T o_ps , respectively, making it possible

to use a single instead of n subsequent table lookups. Hence, all compatible map-

pings can be retrieved at once thus saving n - 1 lookups for each invocation of the

map function. Algorithm 5.3 outlines this optimized case.

Algorithm 5.3: MAPSIN Optimized Multiway Join: map(inKey, inValue)

input : inKey, inValue: input mapping

output : multiset of solution mappings

1 # p Cong.getNumberOfMultiwayPatterns()

2

3 Ω n +# p ∅

4 / / iterate over all subsequent multiway patterns

5

6

7

8 end

9

10 if results ≠ ∅ then

11

μ n inValue .getInputMapping()

for i 1 to # p do

p n+i Config.getNextPattern()

p n+i μ n ( p n+i )/ / substitute shared vars in p n + i

sub

results HBase.GET( p n + i , ..., p n +# p ) / / table index lookup with substituted pattern

/ /merge μ n with compatible mappings for p n + i , ..., p n +# p

12

foreach mapping μ in results do

13

μ n +# p μ n μ'

14

15 end

16 emit( null , Ω n +# p )

17 end

Ω n+ # p Ω n+ # p μ n+ # p

5.6.4 o ne -P attern Q ueries

Queries with only one single triple pattern, that return only a small number of solu-

tion mappings, can be executed locally on one machine. In general however, the

number of solution mappings can exceed the capabilities of a single machine for

large data sets. Thus, concerning scalability, it is advantageous to use a distributed

execution with MapReduce even if we do not need to perform a join. The map phase

is initialized with a distributed table scan for the single triple pattern p 1 . Hence, map

functions get only those mappings as input, which are locally available. The map

function itself has only to emit the mappings to HDFS without any further requests

to HBase. If the query result is small, nondistributed query execution can reduce

query execution time significantly as MapReduce initialization takes up to 30 sec-

onds in our cluster, which clearly dominates the execution time.

Large Scale and Big Data: Processing and Management

Search WWH ::

Custom Search

Home