Database Reference
In-Depth Information
pLabel
by Sub,
pProp
by Sub...
prodFeature
by Sub;
(Prod1,
prodFeature
, PF1)
(Prod1,
prodFeature
, PF2)
(Prod1,
prodFeature
, PF3)
(Prod1,
prodFeature
, PF4)
Prod1,
(Prod1,
pLabel
, Prod1)
,
( ...,
pProp
,...)
,...,
FIGURE 6.18
Nested tuple resulting from a
COGROUP
on vertically partitioned relations.
Early Complete Unnesting: Reduce-Side Full Replication.
Apache Pig's
nested data model can be exploited to eliminate data redundancy while representing
star-join results corresponding to Star-MVP. This can be achieved by processing
star joins as a
COGROUP
operation that is used to group multiple relations on the
same column, such as the Subject column in this case. A
COGROUP
on
N
relations,
results in a nested tuple with
N
columns, where each column is a bag containing
corresponding tuples from the participating relations as represented in Figure 6.18.
The
COGROUP
-based star-join computation of Star-MVPs minimizes the redun-
dancy factor for
MR
SJ
1
(
RedF
SJ
1
= 0), and reduces the amount of disk writes (
R
Write
)
in
MR
SJ
1
. However, each “column” in the result of a
COGROUP
is a bag of tuples,
and Pig's
JOIN
operator is not defined on nested columns. Hence, processing any
subsequent join operation requires unnesting (or flattening in Pig Latin parlance) of
the join column.
Remark 6.7
In the case of a join operation on any of the single-valued columns such as
object of
pLabel
, we may
partially
unnest the nested tuple based on the required col-
umn, which does not affect redundancy. However, a join on the multivalued column
requires complete unnesting of the tuples resulting in redundant information about
the single-valued columns. Both partial and complete unnesting can be achieved
using the
FLATTEN
operator at the end of the reduce phase that generates the input
to the
MVJoin
operation (
MR
SJ
1
as shown in Figure 6.17a). Complete unnesting of the
nested tuple results in full replication of the Star-MVP, and the replication factor
Rep
is a function of the multiplicity of the multivalued property.
Lazy Complete Unnesting: Map-Side Full Replication.
NTGA operators are
nesting-aware and do not require unnesting before operations such as join. This
allows the unnesting of Star-MVPs to be delayed to the map phase of the
MVJoin
operation (
MR
J
1
in Figure 6.17b) as opposed to the reduce phase of a previous cycle
(
MR
SJ
1
in Figure 6.17a). To support multivalued properties, RDFMap is extended to
support (
Property
, List〈
Object
〉) pairs such as (
prodFeature
, {
PF
1,
PF
2,
PF
3,
PF
4,
PF
5}) as shown in Figure 6.19. For the rest of this discussion, we use the notation
Pr
1_
rMap
-
PF
1, …,
PFn
to refer to a triplegroup corresponding to Product
Pr
1 and
containing an MV property
prodFeature
with
n
object values
PF
1, …,
PFn
.
Search WWH ::
Custom Search