Database Reference
In-Depth Information
the mappers, MR Sort —sort-merge costs, as well as MR TR —network transfer costs,
and finally R Write —the cost of writing the reduce output to the HDFS, that is, cost of
a MapReduce cycle MR i can be summarized as
M Read + ( M Write + MR Sort + MR TR ) + R Write
6.2.1 rDF D ata P roCessing on m aP r eDuCe
Most nontrivial data-processing tasks will require multiple MapReduce cycles that
are chained together into sequential, but data dependency-preserving, execution
workflows. There are often multiple possible decompositions of a task into cycles or
subtasks. Therefore, given a data-processing task, a key question is to find a decom-
position that produces a low overall workflow cost. Since, as mentioned earlier, each
MapReduce cycle incurs a significant amount of overhead, decompositions that
result in shorter workflows have a better chance of fulfilling that requirement. In
the context of RDF data processing, the decomposition problem is the distribution
of SELECT , PROJECT , and particularly JOIN operations, into subtasks where each
subtask is supported by a MapReduce cycle. The decomposition issue is related to
the ordering of operations because only neighboring operations in a query plan can be
effectively grouped into the same subtask. In the MapReduce context, it is helpful to
guide the ordering of operations based on key partitioning requirements so that neigh-
boring operations do not have conflicting partitioning requirements. This would force a
repartitioning step between such operations, delineating them into different cycles. For
example, the two JOIN operations that join the first three triple patterns in our example
query Q1 (refer Figure 6.1b) can be computed as a multiway join on the Subject column
( star-join ), and hence can be grouped and executed in the same MapReduce (MR)
cycle. Figure 6.1c shows the relational algebra plan after decomposing query Q1 based
on star joins, which required a total of 5 MR cycles—one for each of the three star
joins ( SJ 1, SJ 2, SJ 3), and two MR cycles to compute the joins between the stars ( J 1 and
J 2). To understand the impact of length of workflow on performance, we present a case
study that empirically evaluates query execution strategies on MapReduce.
6.2.2 C ase s tuDy : D iFFerent g rouPings oF s tar -J oins
Figure 6.2 shows a case study with 6 test queries (each with two star subpatterns)
using the BSBM synthetic benchmark data set (43 GB) on a 10-node Hadoop cluster.
The test queries have varying join structures with object-subject join (Q1a, Q1b,
Q2a, Q2b) and object-object join (Q3a, Q3b) between star patterns. Queries Q1b,
Q2b, Q3b are variations of Q1a, Q2a, Q3a, respectively, where one of the two star
joins is highly selective due to an additional filter on the object column. We evalu-
ated three approaches that execute, (i) a star join per cycle approach ( SJ-per-cycle ),
(ii)  (ii) most selective grouping of joins first but preserving star structure as much as
possible to minimize MR cycles ( (Sel-SJ-first), ), and (iii) grouping-based approach that
concurrently computes all star joins in a single cycle ( NTGA ). SJ-per-cycle approach
requires three MR cycles for all queries (two of three cycles require full scan of
Search WWH ::




Custom Search