Database Reference
In-Depth Information
operators are assigned to some map phase, a POJoinPackage operator is always
assigned to a reduce phase since its input includes tuples with the same key from all
mappers. Once the compilation process is finished, Pig initiates the launch of the
first MR job for execution on Hadoop.
6.5
AN ALTERNATIVE ALGEBRA FOR EVALUATING
GRAPH PATTERN QUERIES ON MapReduce
6.5.1 t he C ase For a “g rouPs oF t riPles ” D ata m oDel anD a lgebra
Relational equi-join operations link related triples and very frequently a large pro-
portion of the join operations are used to assemble attributes or immediate rela-
tionships of a resource. In graph theoretic terms, these operations reconstruct
star-subgraphs rooted at a subject node. The typical MapReduce execution plans
generated by relational-like platforms such as Hive and Pig consist of a sequence of
MapReduce cycles, one for each star-join, and subsequent MapReduce cycles for the
remaining joins in the query. For example, the graph pattern query in Figure 6.5 can
be evaluated in 3 MR cycles: MR 1 and MR 2 to compute star subpatterns SJ 1 and SJ 2,
respectively, followed by a third-cycle MR 3 to join the star-join results.
However, grouping of joins based on star structures does not necessarily result
in the typical join order generated using traditional cost-based optimization. One
challenge is that most cloud processing platforms are used in an on-demand model,
where precomputed statistics for cost-based optimization may not be available or
take too long to compute, resulting in long lead times. More importantly, ordering
joins in terms of their costs may generate some linear subplans requiring one input as
the full triple relation, which in the absence of an index is a full scan. Such plans may
incur larger overhead due to HDFS reads, which outweighs the savings achieved by
pushing selective joins ahead. However, if we consider a different representation for
related triples, it might be possible to compute the required star subgraphs with fewer
MapReduce cycles. For example, a GROUP BY operation on the subject column of
the triple relation will produce all groups of triples with the same subject, that is,
equivalent to star-subgraphs in a relation.
As an illustration, consider the star subqueries SJ 1 and SJ 2 in the query and the
result of the GROUP BY operation on input relation T as shown in Figure 6.5. The
groups of triples or triplegroups tg 1 , tg 2 , and tg 3 can be seen as potential matches to
the star subpatterns SJ 1 and SJ 2, respectively. They contain the same content or are
“content-equivalent” to n-tuples resulting from the corresponding relational-style
join operations. In other words, given a query with n star subpatterns, the grouping-
based approach results in an MR execution plan with n MR cycles, as opposed to
(2 n − 1) cycles using the relational-style approach.
The triplegroup model has another important property in that it represents multi-
valued relationships implicitly. Join operations involving multivalued relationships
result in redundant information due to the repetition of associated attributes with
each distinct value of the multivalued relationship. However, the triplegroup model
eliminates such redundancy by implicitly representing subgraphs that contain multi-
valued relationships. We will revisit this issue in a later section.
Search WWH ::




Custom Search