Algebraic Optimization of RDF Graph Pattern Queries on MapReduce - Large Scale and Big Data: Processing and Management

Database Reference

In-Depth Information

where possible to minimize the costs of parameter passing, context switch-

ing between methods, and the number of MR jobs required. For example,

the traditional execution plan would execute the grouping step in the reduce

phase of a MapReduce cycle, then the groupfiltering phase in a map cycle of

the subsequent MapReduce cycle, thus requiring at least 2 MapReduce cycles.

In RAPID+, both operations are merged into a single MapReduce cycle by the

introduction of a new operator called POTGPackage , which coalesces the

TG _ GroupFilter operator into the reduce-side of Pig's relational GROUP BY

operator ( POPackage ). Other logical operators are mapped into the physical

ones similar to the Pig's case; for example, the LOTGJoin is mapped into multi-

ple physical operators (e.g., POTGJoinAnnotator and POTGJoinPackage ).

Finally, the physical plan is divided into multiple MR jobs shown in Figure

6.9c. Note that the relational-style plan shown in Section 6.3 needs three MR

jobs while the NTGA-based plan only requires two MR jobs, saving one MR

cycle.

Remark 6.4

Generally, the relational-style approach using Pig Latin operators requires the

(2 n − 1) MR jobs to process the query with n star subpatterns. In the NTGA-based

MR workflow, the number of MR jobs for producing stars is always one because

grouping operation essentially produces all the stars in the first MR job. Therefore,

NTGA-based plan processes the same query using only n MR jobs.

6.6.3 i imPlementation oF ntga o Perators

6.6.3.1 Data Model Representation—RDFMap

The Pig Latin data model supports a collection data type called a bag that can

be used to capture a group of tuples or in the NTGA context, a triplegroup. A

bag is implemented as an array list of tuples and provides an iterator to pro-

cess them. Consequently, implementing NTGA operators such as TG _ Filter ,

TG _ GroupFilter , TG _ Join , etc., using this data structure requires an itera-

tion through the data bag, which is expensive. For example, given a graph pattern

with a set of triple patterns TP and a data graph represented as a set of triplegroups

TG , the TG _ GroupFilter operator requires matching each triple pattern in TP

with each tuple t for each triplegroup tg ∈ TG . In addition, representing triples as

3-tuple ( s , p , o ) results in redundant s ( o ) components for subject (object) triple-

groups. RAPID+ uses an extended map structure called RDFMap that captures, (i)

the subject Sub associated with the triples in a triplegroup, (ii) a hashmap propMap

that records the mappings from property types to object values, and (iii) a struc-

ture-label EC that encodes property types in the triplegroup. This enables efficient

look-up of triples matching a given triple pattern and a compact representation of

intermediate results. Since subject of triples in a triplegroup are often repeated,

RDFMap avoids this redundancy using a single-field Sub to represent the subject

component. Using this representation model, a nested triplegroup can be supported

Search WWH ::

Custom Search

Home