Algebraic Optimization of RDF Graph Pattern Queries on MapReduce - Large Scale and Big Data: Processing and Management

Database Reference

In-Depth Information

6.9 NESTING-AWARE PHYSICAL OPERATORS TO MINIMIZE

DATA TRANSFER COSTS IN NTGA EXECUTION PLANS

In this section, we consider the issue of efficient management of intermediate results

while evaluating graph pattern queries with multivalued relationships. Many real-

world data sets contain multivalued attributes or relationships, for example, friend-

ships in a social network, citation references. An issue with this in join-intensive

processing is that many of the combinations of tuples generated by a join operation

contain some redundancy. Specifically, the subtuple containing the non-multivalued

attributes is repeated for each distinct value of the multivalued attribute.

Example 6.7: Graph Pattern with Multivalued Property

Consider the join SJ 1 in Figure 6.16, which is a star join among relations T TpLabel, ,

T pProp , and T prodFeature on the Sub column, to reassemble the label, property, and fea-

ture of products. Note that prodFeature is a multivalued property that defines the

one-to-many relationship between a product and its features. Consider the output

tuples of the star join in Out MR 1 that represents details about a product Prod 1 with

multiple product features ( PF 1, PF 2, etc.). The subtuple labeled ( Sub 1, Prop 1, Obj 1,

Prop 2, Obj 2, Prop 3) is repeating for each distinct value of the product feature.

Remark 6.5

We define redundancy factor of an output as the portion of redundant data in the

output, that is written onto the HDFS at the end of a MapReduce cycle. Typically,

the redundancy factor is proportional to the multiplicity of the multivalued attribute.

Multivalued properties with high multiplicity such as Facebook friends of highly

social persons, result in a high redundancy factor in intermediate results when part

of graph pattern queries.

Impact of Redundancy on Processing Costs. The redundancy factor is likely to

compound across subsequent join operations, that is, the portion of redundant data in

Out MR 1 increases further after join J1' (refer to Out MR 3 in Figure 6.16). This ripple effect

of redundancy in intermediate results has a negative impact on the HDFS writes of the

current cycle, as well as the HDFS reads and data-shuffling costs of subsequent cycles.

The impact on HDFS writes is significant while using flat data models. Additionally,

the bloated intermediate results also impact the total disk space requirements in systems

such as Hadoop, that store intermediate results till the completion of the entire execution

workflow. Hence, efficient management of redundancy while processing join-intensive

data-processing workloads is important to keep MR workflows nimble and cost-effective.

Remark 6.6

NTGA's nested data model already enables concise representation of intermediate

results. For example, m star subgraphs containing redundant information due to the

Search WWH ::

Custom Search

Home