Database Reference
In-Depth Information
6.9.2 C ase s tuDy : i mPaCt oF n esting anD l azy u nnesting s trategies
For g raPh P attern Q ueries with m ultivalueD P roPerties
This section presents a study on the impact of the proposed nesting and unnest-
ing strategies on minimizing the redundancy factor in intermediate results while
processing graph pattern queries. The comparative evaluation included two popular
relational-style systems, Apache Pig ( Pig-Opt with COGROUP -based star-join com-
putation) and Hive ( Hive ), both of which support tuple-based algebra. NTGA-Opt
denotes NTGA with lazy partial unnesting strategy.
Setup and Testbed. Experiments were conducted on a 10-node Hadoop cluster
with Pig release 0.10.0, Hive 0.8.1, and Hadoop 0.20.2. The BSBM [11] synthetic
benchmark data set was used for evaluation, which consists of two multivalued
properties productFeature with approximate multiplicity 19 and product type
with multiplicity 6. The results presented in this section are for BSBM-500K data
set with 500,000 products (43 GB in size). Two categories of queries that involve
multivalued properties were considered, (i) non-MV join —the join variable is
single-valued, and (ii) MVJoin —the join variable is the object of a multivalued
prop er t y.
Impact of the Nesting Strategy. Figure 6.20a shows the performance evalua-
tion of the approaches for queries containing one multivalued property with low
(product type with 6) and high (product feature with 19) multiplicity, respectively.
Figure 6.20b denotes the redundancy factor in intermediate results while evaluating
the queries using at tuple-based algebra in Hive. Queries low-1Star and high-1Star
(both with one star subpattern) can be computed in a single MR cycle ( MR S 1 ) and
their reduce output contains a redundancy factor of 0.72 and 0.82, respectively, when
evaluated using Pig/Hive. The two star subpattern queries ( low-2Star and high-
2Star ) demonstrate how the redundancy factor compounds across the subsequent
join cycle. While the redundancy factor of low-2Star increases from 0.72 (in MR S 1 )
to 0.78 after the subsequent join in MR S 1⋈ S 2 , for high-2Star it increases from 0.82 (in
MR S 1 ) to 0.89 (in MR S 1⋈ S 2 ).
The impact of the redundancy factor on HDFS writes can be seen in Figure
6.20a. Both Hive/Pig-Opt approaches failed to complete execution for high-2Star
on a 10-node cluster due to insufficient disk space (denoted as a missing bar for
Hive). This failure can be attributed to the blow-up of the intermediate results. Hive
approach occupied 52% more disk space after the star-join phase when compared
with the nested approaches. On the contrary, the nested approaches ( Pig-Opt and
NTGA ) required 71.5% / 86.6% less disk space overall, when compared with Hive for
queries low-1Star / high-2Star , resp e ct ively.
Impact of the Lazy Unnesting Strategy. This evaluation included four MVJoin
queries with varying density of star subpattern containing the multivalued property—
MV-2p to MV-5p whose Star-MVP consists of 2 to 5 triple patterns, respectively.
Denser star-join structures result in larger size of non-MV components and hence a
higher redundancy factor. The lazy unnesting strategies in NTGA outperform both
Hive and the early complete unnesting in Pig-Opt for all queries. As the size of the
redundant component increases, NTGA shows an increasing performance gain over
Pig-Opt from 61% in MV-2p to 68% in MV-5p .
Search WWH ::




Custom Search