Database Reference
In-Depth Information
Example 6.8: Lazy Unnesting
The lazy unnesting approach delays the unnesting of Pr 1_ rMap till the map phase
of the MVJoin , thus minimizing the redundancy factor in the output of MR SJ 1 . This
translates to savings in disk writes ( R Write ) in the star-join phase as well as reduced
amount of reads ( M Read ) in the subsequent MVJoin phase. NTGA's TG _ MVJoin
operator implements the map-side unnest operation, which completely unnests
a Star-MVP and generates a map output tuple for each attened copy of the Star-
MVP. For example, Pr 1_ rMap is unnested into five triplegroups, one for each of
the distinct product features. Hence, the replication factor Rep is a function of the
multiplicity of the multivalued property.
Lazy Partial Unnesting: Map-Side Partial Replication. If the multiplic-
ity of a multivalued property is greater than the number of partitions in the
reducer space, it is likely that multiple copies of the Star-MVP are assigned to
the same partition. Consider Figure 6.19 with 2 reducers ( r = 2), where 3 cop-
ies of the map output value Pr 1_ rMap corresponding to the join keys PF 1, PF 3,
PF 5, respectively, are mapped to the same Reducer_ bkt 1. The map-side sorting
costs ( MR Sort ), local writes ( M Write ), and network communication costs ( MR TR ) can
be reduced if the references to Pr 1_ rMap can be shared across the reduce func-
tion space ( rf_ bkt or a group of tuples processed by the same reduce function),
that is, if the replication factor Rep can be reduced. This can be achieved using
an extended partitioning scheme that allows sharing data references in the map
output to avoid full replication.
Example 6.9: Lazy Partial Unnesting
For our map input Pr 1_ rMap shown in Figure 6.19, an example partition scheme
func * could map the keys { PF 1, PF 3, PF 5} to the same group key k 1* as shown in
Figure 6.19b. Consequently, only 1 copy of Pr 1_ rMap is transferred to Reducer_
bkt 1, reducing the shuffle costs. The partial unnest operation partially unnests
triplegroups based on the grouping function func * and is integrated into the map
phase of an optimized join operator, TG _ OptMVJoin .
Implementation of TG _ OptMVJoin . The optimized NTGA operator
TG _ OptMVJoin , implements the lazy partial unnesting strategy for joins involving
multivalued property. Algorithm 6.5 shows the extensions to POTGJoinAnnotator
to enable partial unnesting. In the map phase, RDFMaps that join on subject Sub are
annotated using its group key k * computed by k * = func *( Sub ) (lines 1-2). For joins
on object, RDFMap is partially unnested using the partial-unnest operation
(lines 7-11). The partial-unnest operator splits the object list of the multival-
ued property based on the Object's group key k * = func * ( Obj ), resulting in a list of
partially unnested RDFMaps ( pList in line 3). A map output tuple is generated for
each partially unnested RDFMap, annotated by its group key (lines 4-6). The repli-
cation factor Rep is now a function of func *.
Search WWH ::




Custom Search