Database Reference
In-Depth Information
pLabel by Sub, pProp by Sub... prodFeature by Sub;
(Prod1, prodFeature , PF1)
(Prod1, prodFeature , PF2)
(Prod1, prodFeature , PF3)
(Prod1, prodFeature , PF4)
Prod1,
(Prod1, pLabel , Prod1)
,
( ..., pProp ,...)
,...,
FIGURE 6.18
Nested tuple resulting from a COGROUP on vertically partitioned relations.
Early Complete Unnesting: Reduce-Side Full Replication. Apache Pig's
nested data model can be exploited to eliminate data redundancy while representing
star-join results corresponding to Star-MVP. This can be achieved by processing
star joins as a COGROUP operation that is used to group multiple relations on the
same column, such as the Subject column in this case. A COGROUP on N relations,
results in a nested tuple with N columns, where each column is a bag containing
corresponding tuples from the participating relations as represented in Figure 6.18.
The COGROUP -based star-join computation of Star-MVPs minimizes the redun-
dancy factor for MR SJ 1 ( RedF SJ 1 = 0), and reduces the amount of disk writes ( R Write )
in MR SJ 1 . However, each “column” in the result of a COGROUP is a bag of tuples,
and Pig's JOIN operator is not defined on nested columns. Hence, processing any
subsequent join operation requires unnesting (or flattening in Pig Latin parlance) of
the join column.
Remark 6.7
In the case of a join operation on any of the single-valued columns such as
object of pLabel , we may partially unnest the nested tuple based on the required col-
umn, which does not affect redundancy. However, a join on the multivalued column
requires complete unnesting of the tuples resulting in redundant information about
the single-valued columns. Both partial and complete unnesting can be achieved
using the FLATTEN operator at the end of the reduce phase that generates the input
to the MVJoin operation ( MR SJ 1 as shown in Figure 6.17a). Complete unnesting of the
nested tuple results in full replication of the Star-MVP, and the replication factor Rep
is a function of the multiplicity of the multivalued property.
Lazy Complete Unnesting: Map-Side Full Replication. NTGA operators are
nesting-aware and do not require unnesting before operations such as join. This
allows the unnesting of Star-MVPs to be delayed to the map phase of the MVJoin
operation ( MR J 1 in Figure 6.17b) as opposed to the reduce phase of a previous cycle
( MR SJ 1 in Figure 6.17a). To support multivalued properties, RDFMap is extended to
support ( Property , List〈 Object 〉) pairs such as ( prodFeature , { PF 1, PF 2, PF 3, PF 4,
PF 5}) as shown in Figure 6.19. For the rest of this discussion, we use the notation
Pr 1_ rMap - PF 1, …, PFn to refer to a triplegroup corresponding to Product Pr 1 and
containing an MV property prodFeature with n object values PF 1, …, PFn .
Search WWH ::




Custom Search