Database Reference
In-Depth Information
respectively. The experiments were conducted on a 10-node cluster with D1 data set
(51 GB). The results show that RAPID+ shows a performance gain of 60% over the
default Pig implementation. This is due to the reduced MapReduce execution work-
flow length, for example, 4 MR cycles for Q4 vs. 7 MR cycles for Pig.
6.7.3 s Calability with i inCreasing C luster s ize
Figure 6.11b demonstrates the scalability of NTGA-based approach in RAPID+,
against relational-style approach in Pig with varying size of clusters. Evaluation
was done using query Q5 with 7 triple patterns evaluated on D2 data set (43 GB).
RAPID+ shows a performance gain of 31% over Pig approach with the 10-node clus-
ter, which increases to 41% as we increase the cluster size to 30-nodes. The increase
in cluster size enables more parallelization of the grouping based star-join computa-
tion in RAPID+, further reducing the overall execution time.
6.8 INTRAQUERY SCAN SHARING FOR NTGA EXECUTION PLANS
In this section, we consider the problem of sharing scans within a query. This prob-
lem arises when a query contains repeated occurrences of a property participating in
different join operations.
Example 6.4: Graph Pattern Query with Repeated Properties
Common examples are the multiple use of the properties in RDF schema such as
rdf:type and rdfs:label in a single graph pattern, for example, the query in Figure
6.12a contains a repeated property label across the two star patterns, SJ 1 and SJ 2.
Such graph patterns are commonly used because the RDF model allows liberal
use of properties for describing resources to reflect different contexts of resources
and resources in heterogeneous collections may have a variety of properties
describing them. Relational-style processing of such graph patterns, results in
scanning the property relation once for each operation, leading to multiple scans
that increase the overall I/O overhead of such workflows. Figure 6.12c shows the
MR workflows using 3 MR jobs based on the relational-style approach. In this MR
workflow, the job 1 and 2 scan the input relation twice to select the same triples
whose properties are ( type , label , date ) in each job.
6.8.1 s Can -s haring s trategies For e FFiCient
P roCessing oF r ePeateD P roPerties
To avoid such multiple scans on a relation, we may either buffer the relation for the
duration it is needed (if memory is available) or we may use DAG (directed acyclic
graph)-shaped plans so that the output of an operator can be sent to more than one
operator. This requires either interoperator or pipelined parallelism that allows con-
current execution of operators to be enabled. However, neither one of these scenarios
is possible in the MapReduce model. Hence, there is a need for approaches that
Search WWH ::




Custom Search