Algebraic Optimization of RDF Graph Pattern Queries on MapReduce - Large Scale and Big Data: Processing and Management

Database Reference

In-Depth Information

respectively. The experiments were conducted on a 10-node cluster with D1 data set

(51 GB). The results show that RAPID+ shows a performance gain of 60% over the

default Pig implementation. This is due to the reduced MapReduce execution work-

flow length, for example, 4 MR cycles for Q4 vs. 7 MR cycles for Pig.

6.7.3 s Calability with i inCreasing C luster s ize

Figure 6.11b demonstrates the scalability of NTGA-based approach in RAPID+,

against relational-style approach in Pig with varying size of clusters. Evaluation

was done using query Q5 with 7 triple patterns evaluated on D2 data set (43 GB).

RAPID+ shows a performance gain of 31% over Pig approach with the 10-node clus-

ter, which increases to 41% as we increase the cluster size to 30-nodes. The increase

in cluster size enables more parallelization of the grouping based star-join computa-

tion in RAPID+, further reducing the overall execution time.

6.8 INTRAQUERY SCAN SHARING FOR NTGA EXECUTION PLANS

In this section, we consider the problem of sharing scans within a query. This prob-

lem arises when a query contains repeated occurrences of a property participating in

different join operations.

Example 6.4: Graph Pattern Query with Repeated Properties

Common examples are the multiple use of the properties in RDF schema such as

rdf:type and rdfs:label in a single graph pattern, for example, the query in Figure

6.12a contains a repeated property label across the two star patterns, SJ 1 and SJ 2.

Such graph patterns are commonly used because the RDF model allows liberal

use of properties for describing resources to reflect different contexts of resources

and resources in heterogeneous collections may have a variety of properties

describing them. Relational-style processing of such graph patterns, results in

scanning the property relation once for each operation, leading to multiple scans

that increase the overall I/O overhead of such workflows. Figure 6.12c shows the

MR workflows using 3 MR jobs based on the relational-style approach. In this MR

workflow, the job 1 and 2 scan the input relation twice to select the same triples

whose properties are ( type , label , date ) in each job.

6.8.1 s Can -s haring s trategies For e FFiCient

P roCessing oF r ePeateD P roPerties

To avoid such multiple scans on a relation, we may either buffer the relation for the

duration it is needed (if memory is available) or we may use DAG (directed acyclic

graph)-shaped plans so that the output of an operator can be sent to more than one

operator. This requires either interoperator or pipelined parallelism that allows con-

current execution of operators to be enabled. However, neither one of these scenarios

is possible in the MapReduce model. Hence, there is a need for approaches that

Search WWH ::

Custom Search

Home