Database Reference
In-Depth Information
2500
SJ-per-cycle
Sel-SJ-first
2000
NTGA
1500
1000
500
0
Q1a
Q1b
Q2a
Q2b
Q3a
Q3b
BSBM-500K: 43 GB, 10-node
FIGURE 6.2
A comparative evaluation of different groupings of star-joins.
triple relation). For object-subject joins, Sel-SJ-first approach can group joins into
just two MR cycles (both cycles scan entire triple relation). For the object-object join
(Q3a, Q3b), Sel-SJ-first still requires three MR cycles, but more importantly has very
high HDFS reads due to full scan of triple relation in all three cycles. In contrast,
the NTGA approach is able to minimize the number of MR cycles (two cycles for all
queries), as well as minimize the required number of full scans of the triple relation,
thus outperforming the other two approaches for all the test queries.
Besides the issue of workflow execution length, the sizes of intermediate outputs
and inputs, have an impact on performance. This is because M Read , M Write , MR Sort ,
MR TR , and R Write , are all functions of the size of data. In addition to the impact of the
intermediate data size on disk I/Os and network traffic, which affect query latency,
size of intermediate results also impact the disk space requirements for a MapReduce
workflow. This is because systems such as Hadoop provide fault-tolerance by stor-
ing intermediate results, until the workflow completes. Therefore, to successfully
complete the execution of a workflow with k MR cycles MR 1 to MR k , the amount of
available disk space should be at least equal to
(
) ×
InpOut
+
+
Out
++
...
Out
Rep DFS
MR
MR
MR k
1
2
where Inp is the initial input, Out MR i is reduce output for the i ith MR cycle, and Rep DFS
is the configured replication factor of the distributed file system.
Remark 6.2
The above discussion highlights that two key objectives for optimizing evaluation of
queries on MapReduce platforms are minimizing length of execution workflow and
minimizing the footprint of intermediate results.
6.3 RELATED WORK
Over the last decade, there has been significant research in developing efficient and
scalable RDF processing systems. State-of-the-art single-node systems [10,30,46] have
Search WWH ::




Custom Search