Database Reference
In-Depth Information
The performance of MAPSIN joins strongly correlates with the number of index
lookups in HBase. Hence, minimizing the number of lookups is a crucial point for
optimization. In many situations, it is possible to reduce the number of requests by
leveraging the RDF schema design for HBase outlined in Section 5.5. If the join
variable for all triple patterns is always on subject or always on object position, then
all mappings for p 2 , p 3 ,…, p n+ 1 that are compatible to the input mapping μ 1 for p 1 are
stored in the same HBase table row of T s_po or T o_ps , respectively, making it possible
to use a single instead of n subsequent table lookups. Hence, all compatible map-
pings can be retrieved at once thus saving n  - 1 lookups for each invocation of the
map function. Algorithm 5.3 outlines this optimized case.
Algorithm 5.3: MAPSIN Optimized Multiway Join: map(inKey, inValue)
input : inKey, inValue: input mapping
output : multiset of solution mappings
1 # p Cong.getNumberOfMultiwayPatterns()
2
3 n +# p
4 / / iterate over all subsequent multiway patterns
5
6
7
8 end
9
10 if results ≠ ∅ then
11
μ n inValue .getInputMapping()
for i 1 to # p do
p n+i Config.getNextPattern()
p n+i μ n ( p n+i )/ / substitute shared vars in p n + i
sub
sub
sub
results HBase.GET( p n + i , ..., p n +# p ) / / table index lookup with substituted pattern
/ /merge μ n with compatible mappings for p n + i , ..., p n +# p
12
foreach mapping μ in results do
13
μ n +# p μ n μ'
14
15 end
16 emit( null , Ω n +# p )
17 end
n+ # p n+ # p μ n+ # p
5.6.4 o ne -P attern Q ueries
Queries with only one single triple pattern, that return only a small number of solu-
tion mappings, can be executed locally on one machine. In general however, the
number of solution mappings can exceed the capabilities of a single machine for
large data sets. Thus, concerning scalability, it is advantageous to use a distributed
execution with MapReduce even if we do not need to perform a join. The map phase
is initialized with a distributed table scan for the single triple pattern p 1 . Hence, map
functions get only those mappings as input, which are locally available. The map
function itself has only to emit the mappings to HDFS without any further requests
to HBase. If the query result is small, nondistributed query execution can reduce
query execution time significantly as MapReduce initialization takes up to 30 sec-
onds in our cluster, which clearly dominates the execution time.
 
Search WWH ::




Custom Search