Algebraic Optimization of RDF Graph Pattern Queries on MapReduce - Large Scale and Big Data: Processing and Management

Database Reference

In-Depth Information

6.9.2 C ase s tuDy : i mPaCt oF n esting anD l azy u nnesting s trategies

For g raPh P attern Q ueries with m ultivalueD P roPerties

This section presents a study on the impact of the proposed nesting and unnest-

ing strategies on minimizing the redundancy factor in intermediate results while

processing graph pattern queries. The comparative evaluation included two popular

relational-style systems, Apache Pig ( Pig-Opt with COGROUP -based star-join com-

putation) and Hive ( Hive ), both of which support tuple-based algebra. NTGA-Opt

denotes NTGA with lazy partial unnesting strategy.

Setup and Testbed. Experiments were conducted on a 10-node Hadoop cluster

with Pig release 0.10.0, Hive 0.8.1, and Hadoop 0.20.2. The BSBM [11] synthetic

benchmark data set was used for evaluation, which consists of two multivalued

properties productFeature with approximate multiplicity 19 and product type

with multiplicity 6. The results presented in this section are for BSBM-500K data

set with 500,000 products (43 GB in size). Two categories of queries that involve

multivalued properties were considered, (i) non-MV join —the join variable is

single-valued, and (ii) MVJoin —the join variable is the object of a multivalued

prop er t y.

Impact of the Nesting Strategy. Figure 6.20a shows the performance evalua-

tion of the approaches for queries containing one multivalued property with low

(product type with 6) and high (product feature with 19) multiplicity, respectively.

Figure 6.20b denotes the redundancy factor in intermediate results while evaluating

the queries using at tuple-based algebra in Hive. Queries low-1Star and high-1Star

(both with one star subpattern) can be computed in a single MR cycle ( MR S 1 ) and

their reduce output contains a redundancy factor of 0.72 and 0.82, respectively, when

evaluated using Pig/Hive. The two star subpattern queries ( low-2Star and high-

2Star ) demonstrate how the redundancy factor compounds across the subsequent

join cycle. While the redundancy factor of low-2Star increases from 0.72 (in MR S 1 )

to 0.78 after the subsequent join in MR S 1⋈ S 2 , for high-2Star it increases from 0.82 (in

MR S 1 ) to 0.89 (in MR S 1⋈ S 2 ).

The impact of the redundancy factor on HDFS writes can be seen in Figure

6.20a. Both Hive/Pig-Opt approaches failed to complete execution for high-2Star

on a 10-node cluster due to insufficient disk space (denoted as a missing bar for

Hive). This failure can be attributed to the blow-up of the intermediate results. Hive

approach occupied 52% more disk space after the star-join phase when compared

with the nested approaches. On the contrary, the nested approaches ( Pig-Opt and

NTGA ) required 71.5% / 86.6% less disk space overall, when compared with Hive for

queries low-1Star / high-2Star , resp e ct ively.

Impact of the Lazy Unnesting Strategy. This evaluation included four MVJoin

queries with varying density of star subpattern containing the multivalued property—

MV-2p to MV-5p whose Star-MVP consists of 2 to 5 triple patterns, respectively.

Denser star-join structures result in larger size of non-MV components and hence a

higher redundancy factor. The lazy unnesting strategies in NTGA outperform both

Hive and the early complete unnesting in Pig-Opt for all queries. As the size of the

redundant component increases, NTGA shows an increasing performance gain over

Pig-Opt from 61% in MV-2p to 68% in MV-5p .

Large Scale and Big Data: Processing and Management

Search WWH ::

Custom Search

Home