Large-Scale RDF Processing with MapReduce - Large Scale and Big Data: Processing and Management

Database Reference

In-Depth Information

•

Map : Collection of data items where each item can be looked up by an

associated key.

→

'name' 'John'

('Sarah')

('Bob')

'knows'

Operators: A Pig Latin program consists of a sequence of instructions where

each instruction performs a single data transformation. We shortly introduce those

Pig Latin operators that we used for our translation. The interested reader can find a

more detailed description of Pig Latin in [16].

•

LOAD deserializes the input data and maps it to the data model of Pig

Latin. The user can implement a User Defined Function (UDF) that defines

how to map an input tuple to a Pig Latin tuple as shown in the following

example. The result of LOAD is a bag of tuples.

people = LOAD 'input' USING myLoad() AS (name, age);

•

FOR EACH can be used to apply some processing on every tuple of a bag.

It can also be used for projection or adding new fields to a tuple.

A = FOREACH people GENERATE name, age >= 18? 'adult' :

'minor' AS type;

•

FI LTER allows to remove unwanted tuples of a bag.

B = FILTER people BY age >= 18;

•

[ OUTER ] JOIN performs an equi or outer join between bags. It can also be

applied to more than two bags at once (multijoin).

C = JOIN A BY name [LEFT OUTER], B BY name;

•

UNION can be used to combine two or more bags. Unlike relational data-

bases, the schemas of the tuples do not have to match although this is not

recommended in general since the schema information, especially the alias

names of the fields, is lost in such cases.

D = UNION B, C;

•

SPLIT partitions a bag into two or more bags that do not have to be distinct

or complete, that is, tuples can end up in more than one partition or no

partition at all.

SPLIT people INTO E IF age < 18, F IF age >= 21;

Search WWH ::

Custom Search

Home