Algebraic Optimization of RDF Graph Pattern Queries on MapReduce - Large Scale and Big Data: Processing and Management

Database Reference

In-Depth Information

the mappers, MR Sort —sort-merge costs, as well as MR TR —network transfer costs,

and finally R Write —the cost of writing the reduce output to the HDFS, that is, cost of

a MapReduce cycle MR i can be summarized as

M Read + ( M Write + MR Sort + MR TR ) + R Write

6.2.1 rDF D ata P roCessing on m aP r eDuCe

Most nontrivial data-processing tasks will require multiple MapReduce cycles that

are chained together into sequential, but data dependency-preserving, execution

workflows. There are often multiple possible decompositions of a task into cycles or

subtasks. Therefore, given a data-processing task, a key question is to find a decom-

position that produces a low overall workflow cost. Since, as mentioned earlier, each

MapReduce cycle incurs a significant amount of overhead, decompositions that

result in shorter workflows have a better chance of fulfilling that requirement. In

the context of RDF data processing, the decomposition problem is the distribution

of SELECT , PROJECT , and particularly JOIN operations, into subtasks where each

subtask is supported by a MapReduce cycle. The decomposition issue is related to

the ordering of operations because only neighboring operations in a query plan can be

effectively grouped into the same subtask. In the MapReduce context, it is helpful to

guide the ordering of operations based on key partitioning requirements so that neigh-

boring operations do not have conflicting partitioning requirements. This would force a

repartitioning step between such operations, delineating them into different cycles. For

example, the two JOIN operations that join the first three triple patterns in our example

query Q1 (refer Figure 6.1b) can be computed as a multiway join on the Subject column

( star-join ), and hence can be grouped and executed in the same MapReduce (MR)

cycle. Figure 6.1c shows the relational algebra plan after decomposing query Q1 based

on star joins, which required a total of 5 MR cycles—one for each of the three star

joins ( SJ 1, SJ 2, SJ 3), and two MR cycles to compute the joins between the stars ( J 1 and

J 2). To understand the impact of length of workflow on performance, we present a case

study that empirically evaluates query execution strategies on MapReduce.

6.2.2 C ase s tuDy : D iFFerent g rouPings oF s tar -J oins

Figure 6.2 shows a case study with 6 test queries (each with two star subpatterns)

using the BSBM synthetic benchmark data set (43 GB) on a 10-node Hadoop cluster.

The test queries have varying join structures with object-subject join (Q1a, Q1b,

Q2a, Q2b) and object-object join (Q3a, Q3b) between star patterns. Queries Q1b,

Q2b, Q3b are variations of Q1a, Q2a, Q3a, respectively, where one of the two star

joins is highly selective due to an additional filter on the object column. We evalu-

ated three approaches that execute, (i) a star join per cycle approach ( SJ-per-cycle ),

(ii) (ii) most selective grouping of joins first but preserving star structure as much as

possible to minimize MR cycles ( (Sel-SJ-first), ), and (iii) grouping-based approach that

concurrently computes all star joins in a single cycle ( NTGA ). SJ-per-cycle approach

requires three MR cycles for all queries (two of three cycles require full scan of

Search WWH ::

Custom Search

Home