Database Reference
In-Depth Information
where possible to minimize the costs of parameter passing, context switch-
ing between methods, and the number of MR jobs required. For example,
the traditional execution plan would execute the grouping step in the reduce
phase of a MapReduce cycle, then the groupfiltering phase in a map cycle of
the subsequent MapReduce cycle, thus requiring at least 2 MapReduce cycles.
In RAPID+, both operations are merged into a single MapReduce cycle by the
introduction of a new operator called POTGPackage , which coalesces the
TG _ GroupFilter operator into the reduce-side of Pig's relational GROUP BY
operator ( POPackage ). Other logical operators are mapped into the physical
ones similar to the Pig's case; for example, the LOTGJoin is mapped into multi-
ple physical operators (e.g., POTGJoinAnnotator and POTGJoinPackage ).
Finally, the physical plan is divided into multiple MR jobs shown in Figure
6.9c. Note that the relational-style plan shown in Section 6.3 needs three MR
jobs while the NTGA-based plan only requires two MR jobs, saving one MR
cycle.
Remark 6.4
Generally, the relational-style approach using Pig Latin operators requires the
(2 n  − 1) MR jobs to process the query with n star subpatterns. In the NTGA-based
MR workflow, the number of MR jobs for producing stars is always one because
grouping operation essentially produces all the stars in the first MR job. Therefore,
NTGA-based plan processes the same query using only n MR jobs.
6.6.3 i imPlementation oF ntga o Perators
6.6.3.1 Data Model Representation—RDFMap
The Pig Latin data model supports a collection data type called a bag that can
be used to capture a group of tuples or in the NTGA context, a triplegroup. A
bag is implemented as an array list of tuples and provides an iterator to pro-
cess them. Consequently, implementing NTGA operators such as TG _ Filter ,
TG _ GroupFilter , TG _ Join , etc., using this data structure requires an itera-
tion through the data bag, which is expensive. For example, given a graph pattern
with a set of triple patterns TP and a data graph represented as a set of triplegroups
TG , the TG _ GroupFilter operator requires matching each triple pattern in TP
with each tuple t for each triplegroup tg TG . In addition, representing triples as
3-tuple ( s p o ) results in redundant s ( o ) components for subject (object) triple-
groups. RAPID+ uses an extended map structure called RDFMap that captures, (i)
the subject Sub associated with the triples in a triplegroup, (ii) a hashmap propMap
that records the mappings from property types to object values, and (iii) a struc-
ture-label EC that encodes property types in the triplegroup. This enables efficient
look-up of triples matching a given triple pattern and a compact representation of
intermediate results. Since subject of triples in a triplegroup are often repeated,
RDFMap avoids this redundancy using a single-field Sub to represent the subject
component. Using this representation model, a nested triplegroup can be supported
Search WWH ::




Custom Search