Large-Scale RDF Processing with MapReduce - Large Scale and Big Data: Processing and Management

Database Reference

In-Depth Information

applied to the default graph. The graph operator can be used to apply a pattern to

one or all of the named graphs. A named graph is referenced by an unique URI,

and for each graph that is used in the query, we need a pair ( URI , graph ) that speci-

fies where to find the corresponding RDF graph. If a variable is used in the Graph

operator instead of a specific graph URI, the pattern must be applied to all named

graphs.

As we want to execute SPARQL queries on large RDF graphs in a MapReduce

cluster, all graphs must be stored in the distributed file system. Applying a pattern

to one of the named graphs with Pig Latin simply means loading the corresponding

data.

P6. Persons in Graph graphURI Who Know Somebody

SP

Graph(graphURI, BGP(?a knows ?b))

PL

graph1 = LOAD 'pathToGraphURI'

USING RDFLOader() AS (s,p,o);

t1 = FILTER graph1 BY p == 'knows';

P6 = FOREACH t1 GENERATE s AS a, o AS b;

Joins and Null values. As we use at bags to represent solution mappings in

Pig Latin and all tuples of a bag have the same schema we use null values to

indicate that a variable is unbound in a solution mapping. This typically occurs

when using OPTIONAL to add additional information to a solution mapping.

The result of OPTIONAL is a set of solution mappings (i.e., a bag in Pig Latin)

where the optional variables can be unbound for some solution mappings (i.e.,

some tuples of the bag contain null values). However, this is problematic if the

further processing of the query requires a join over these possibly unbound vari-

ables. In SPARQL an unbound variable is compatible to any other binding of that

variable but since Pig Latin follows the relational algebra, a JOIN in Pig Latin is

null rejecting. Assume we have two bags of solution mappings R , S with schemas

(A,B) and (B,C) where R can contain null values for variable B as illustrated in

the following example.

R

AB

a

S

BC

bc

ABC

abc

1

=

b

⋈ SPARQL

1

2

1

a

null

2

The second tuple of R is compatible to any tuple of S since variable B is unbound.

In Pig Latin, we would only get one tuple as join result since the second tuple of R

will not match with any tuple of S . To get the same result in Pig Latin we split R into

two bags (with and without null values) and process them separately, that is, we

perform an equi join for all tuples without null values and a crossproduct for the

tuples with null values.

Large Scale and Big Data: Processing and Management

Search WWH ::

Custom Search

Home