Database Reference
In-Depth Information
6.9 NESTING-AWARE PHYSICAL OPERATORS TO MINIMIZE
DATA TRANSFER COSTS IN NTGA EXECUTION PLANS
In this section, we consider the issue of efficient management of intermediate results
while evaluating graph pattern queries with multivalued relationships. Many real-
world data sets contain multivalued attributes or relationships, for example, friend-
ships in a social network, citation references. An issue with this in join-intensive
processing is that many of the combinations of tuples generated by a join operation
contain some redundancy. Specifically, the subtuple containing the non-multivalued
attributes is repeated for each distinct value of the multivalued attribute.
Example 6.7: Graph Pattern with Multivalued Property
Consider the join SJ 1 in Figure 6.16, which is a star join among relations T TpLabel, ,
T pProp , and T prodFeature on the Sub column, to reassemble the label, property, and fea-
ture of products. Note that prodFeature is a multivalued property that defines the
one-to-many relationship between a product and its features. Consider the output
tuples of the star join in Out MR 1 that represents details about a product Prod 1 with
multiple product features ( PF 1, PF 2, etc.). The subtuple labeled ( Sub 1, Prop 1, Obj 1,
Prop 2, Obj 2, Prop 3) is repeating for each distinct value of the product feature.
Remark 6.5
We define redundancy factor of an output as the portion of redundant data in the
output, that is written onto the HDFS at the end of a MapReduce cycle. Typically,
the redundancy factor is proportional to the multiplicity of the multivalued attribute.
Multivalued properties with high multiplicity such as Facebook friends of highly
social persons, result in a high redundancy factor in intermediate results when part
of graph pattern queries.
Impact of Redundancy on Processing Costs. The redundancy factor is likely to
compound across subsequent join operations, that is, the portion of redundant data in
Out MR 1 increases further after join J1' (refer to Out MR 3 in Figure 6.16). This ripple effect
of redundancy in intermediate results has a negative impact on the HDFS writes of the
current cycle, as well as the HDFS reads and data-shuffling costs of subsequent cycles.
The impact on HDFS writes is significant while using flat data models. Additionally,
the bloated intermediate results also impact the total disk space requirements in systems
such as Hadoop, that store intermediate results till the completion of the entire execution
workflow. Hence, efficient management of redundancy while processing join-intensive
data-processing workloads is important to keep MR workflows nimble and cost-effective.
Remark 6.6
NTGA's nested data model already enables concise representation of intermediate
results. For example, m star subgraphs containing redundant information due to the
Search WWH ::




Custom Search