Large-Scale RDF Processing with MapReduce - Large Scale and Big Data: Processing and Management

Database Reference

In-Depth Information

PL SPLIT R INTO R1 IF B is not null,

R2 IF B is null;

j1 = JOIN R1 BY B, S BY B;

j1 = FOREACH j1 GENERATE A, R1::B AS B, C;

j2 = CROSS R2, S;

j2 = FOREACH j2 GENERATE A, S::B AS B, C;

J = UNION j1, j2;

The complexity increases with the number of join variables that can be unbound,

for example, for two possibly unbound join variables we already have to split the

bag into four distinct paritions (one for every possible combination). Our translator

recognizes if a join contains possibly unbound variables and performs the necessary

changes to the translation automatically. Fortunately, this situation does not occur

in most SPARQL queries. In fact, if a SPARQL query is well designed according to

[12], there are no joins over unbound variables at all.

5.3.3 o Ptimizations

The optimization of SPARQL queries is a subject of current research [17-19]. As we

will demonstrate in the evaluation, optimizing the SPARQL query execution based

on Pig Latin means reducing I/O required to transfer data between the map and the

reduce phase as well as the data that is read or stored in the distributed file system.

1. SPARQL algebra. We investigated some well-known optimization strate-

gies for the SPARQL algebra to reduce the amount of intemediate results,

especially the early execution of filters and the reordering of triple patterns

by selectivity [19]. We used a fixed scheme without statistical information

on the RDFdata set (called variable counting ) where triple patterns with

one variable are considered to be more selective than triple patterns with

two variables and bounded subjects are considered to be more selective

than bounded predicates or objects.

2. Translation. The early projection of redundant data (“ project early and

often, ” e.g., duplicate columns after joins or bounded values that should

not occur in the result) as well as the application of multijoins to reduce the

number of joins in Pig Latin has proven to be very effective. We can use a

multijoin if several consecutive joins refer to the same svariables. Assume

we have three bags (A,B,C) to join by the common variable ? v . Instead of

using two joins, we can use a single multijoin as shown in the following

example.

JOIN A BY v, B BY v, C BY v;

3. Data model. In a typical SPARQL query the predicate of a triple pattern

is mostly bounded, that is, variables are typically used in the subject and

object position. Therefore, a vertical partitioning [20] of the RDF data by

predicates reduces the amount of RDF triples that must be loaded for query

Search WWH ::

Custom Search

Home