Algebraic Optimization of RDF Graph Pattern Queries on MapReduce - Large Scale and Big Data: Processing and Management - page 189

Database Reference

In-Depth Information

2500

SJ-per-cycle

Sel-SJ-first

2000

NTGA

1500

1000

500

0

Q1a

Q1b

Q2a

Q2b

Q3a

Q3b

BSBM-500K: 43 GB, 10-node

FIGURE 6.2

A comparative evaluation of different groupings of star-joins.

triple relation). For object-subject joins, Sel-SJ-first approach can group joins into

just two MR cycles (both cycles scan entire triple relation). For the object-object join

(Q3a, Q3b), Sel-SJ-first still requires three MR cycles, but more importantly has very

high HDFS reads due to full scan of triple relation in all three cycles. In contrast,

the NTGA approach is able to minimize the number of MR cycles (two cycles for all

queries), as well as minimize the required number of full scans of the triple relation,

thus outperforming the other two approaches for all the test queries.

Besides the issue of workflow execution length, the sizes of intermediate outputs

and inputs, have an impact on performance. This is because M Read , M Write , MR Sort ,

MR TR , and R Write , are all functions of the size of data. In addition to the impact of the

intermediate data size on disk I/Os and network traffic, which affect query latency,

size of intermediate results also impact the disk space requirements for a MapReduce

workflow. This is because systems such as Hadoop provide fault-tolerance by stor-

ing intermediate results, until the workflow completes. Therefore, to successfully

complete the execution of a workflow with k MR cycles MR 1 to MR k , the amount of

available disk space should be at least equal to

(

) ×

InpOut

+

+

Out

++

...

Out

Rep DFS

MR

MR

MR k

1

2

where Inp is the initial input, Out MR i is reduce output for the i ith MR cycle, and Rep DFS

is the configured replication factor of the distributed file system.

Remark 6.2

The above discussion highlights that two key objectives for optimizing evaluation of

queries on MapReduce platforms are minimizing length of execution workflow and

minimizing the footprint of intermediate results.

6.3 RELATED WORK

Over the last decade, there has been significant research in developing efficient and

scalable RDF processing systems. State-of-the-art single-node systems [10,30,46] have

Next Page

Large Scale and Big Data: Processing and Management

Search WWH ::

Custom Search

Home