Database Reference
In-Depth Information
where possible to minimize the costs of parameter passing, context switch-
ing between methods, and the number of MR jobs required. For example,
the traditional execution plan would execute the grouping step in the reduce
phase of a MapReduce cycle, then the groupfiltering phase in a map cycle of
the subsequent MapReduce cycle, thus requiring at least 2 MapReduce cycles.
In RAPID+, both operations are merged into a single MapReduce cycle by the
introduction of a new operator called
POTGPackage
, which coalesces the
TG _ GroupFilter
operator into the reduce-side of Pig's relational
GROUP BY
operator (
POPackage
). Other logical operators are mapped into the physical
ones similar to the Pig's case; for example, the
LOTGJoin
is mapped into multi-
ple physical operators (e.g.,
POTGJoinAnnotator
and
POTGJoinPackage
).
Finally, the physical plan is divided into multiple MR jobs shown in Figure
6.9c. Note that the relational-style plan shown in Section 6.3 needs three MR
jobs while the NTGA-based plan only requires two MR jobs, saving one MR
cycle.
Remark 6.4
Generally, the relational-style approach using Pig Latin operators requires the
(2
n
− 1) MR jobs to process the query with
n
star subpatterns. In the NTGA-based
MR workflow, the number of MR jobs for producing stars is always one because
grouping operation essentially produces all the stars in the first MR job. Therefore,
NTGA-based plan processes the same query using only
n
MR jobs.
6.6.3 i
imPlementation
oF
ntga o
Perators
6.6.3.1 Data Model Representation—RDFMap
The Pig Latin data model supports a collection data type called a
bag
that can
be used to capture a group of tuples or in the NTGA context, a triplegroup. A
bag is implemented as an array list of tuples and provides an iterator to pro-
cess them. Consequently, implementing NTGA operators such as
TG _ Filter
,
TG _ GroupFilter
,
TG _ Join
, etc., using this data structure requires an itera-
tion through the data bag, which is expensive. For example, given a graph pattern
with a set of triple patterns
TP
and a data graph represented as a set of triplegroups
TG
, the
TG _ GroupFilter
operator requires matching each triple pattern in
TP
with each tuple
t
for each triplegroup
tg
∈
TG
. In addition, representing triples as
3-tuple (
s
,
p
,
o
) results in redundant
s
(
o
) components for subject (object) triple-
groups. RAPID+ uses an extended map structure called
RDFMap
that captures, (i)
the subject
Sub
associated with the triples in a triplegroup, (ii) a hashmap
propMap
that records the mappings from property types to object values, and (iii) a struc-
ture-label
EC
that encodes property types in the triplegroup. This enables efficient
look-up of triples matching a given triple pattern and a compact representation of
intermediate results. Since subject of triples in a triplegroup are often repeated,
RDFMap avoids this redundancy using a single-field
Sub
to represent the subject
component. Using this representation model, a nested triplegroup can be supported
Search WWH ::
Custom Search