Database Reference
In-Depth Information
PL SPLIT R INTO R1 IF B is not null,
R2 IF B is null;
j1 = JOIN R1 BY B, S BY B;
j1 = FOREACH j1 GENERATE A, R1::B AS B, C;
j2 = CROSS R2, S;
j2 = FOREACH j2 GENERATE A, S::B AS B, C;
J = UNION j1, j2;
The complexity increases with the number of join variables that can be unbound,
for example, for two possibly unbound join variables we already have to split the
bag into four distinct paritions (one for every possible combination). Our translator
recognizes if a join contains possibly unbound variables and performs the necessary
changes to the translation automatically. Fortunately, this situation does not occur
in most SPARQL queries. In fact, if a SPARQL query is well designed according to
[12], there are no joins over unbound variables at all.
5.3.3 o Ptimizations
The optimization of SPARQL queries is a subject of current research [17-19]. As we
will demonstrate in the evaluation, optimizing the SPARQL query execution based
on Pig Latin means reducing I/O required to transfer data between the map and the
reduce phase as well as the data that is read or stored in the distributed file system.
1. SPARQL algebra. We investigated some well-known optimization strate-
gies for the SPARQL algebra to reduce the amount of intemediate results,
especially the early execution of filters and the reordering of triple patterns
by selectivity [19]. We used a fixed scheme without statistical information
on the RDFdata set (called variable counting ) where triple patterns with
one variable are considered to be more selective than triple patterns with
two variables and bounded subjects are considered to be more selective
than bounded predicates or objects.
2. Translation. The early projection of redundant data (“ project early and
often, ” e.g., duplicate columns after joins or bounded values that should
not occur in the result) as well as the application of multijoins to reduce the
number of joins in Pig Latin has proven to be very effective. We can use a
multijoin if several consecutive joins refer to the same svariables. Assume
we have three bags (A,B,C) to join by the common variable ? v . Instead of
using two joins, we can use a single multijoin as shown in the following
example.
JOIN A BY v, B BY v, C BY v;
3. Data model. In a typical SPARQL query the predicate of a triple pattern
is mostly bounded, that is, variables are typically used in the subject and
object position. Therefore, a vertical partitioning [20] of the RDF data by
predicates reduces the amount of RDF triples that must be loaded for query
Search WWH ::




Custom Search