Large-Scale RDF Processing with MapReduce - Large Scale and Big Data: Processing and Management

Database Reference

In-Depth Information

ensure that the subsets of L and R with the same join key values can be processed on

the same machine. Hence, for joining arbitrary data sets on arbitrary keys we gener-

ally have to shuffle data over the network or choose appropriate prepartitioning and

replication strategies.

The most prominent and flexible join technique in MapReduce is called reduce-side

join [14,15]. Some literature also refer to it as repartition join [14] as the idea is based

on reading both data sets (map phase) and repartition them according to the join key

(shuffle phase). The actual join computation is done in the reduce phase. The main

drawback of this approach is that both data sets are completely transferred over the net-

work regardless of the join output. This is especially inefficient for selective joins and

consumes a lot of network bandwidth. Another group of joins is based on getting rid of

the shuffle and reduce phase to avoid transferring both data sets over the network. This

kind of join technique is called map-side join since the actual join processing is done

in the map phase. The most common one is the map-side merge join [15]. However,

this join cannot be applied on arbitrary data sets since a preprocessing step is neces-

sary to fulfill several requirements: data sets have to be sorted and equally partitioned

according to the join key. If the preconditions are fulfilled, the map phase can process

an efficient parallel merge join between presorted partitions and data shuffling is not

necessary. In a sequence of such joins, the shuffle and reduce phases are indeed needed

to fulfill the preconditions for the next join iteration. Therefore, map-side joins are gen-

erally hard to cascade and the advantage of avoiding a shuffle and reduce phase is lost.

In Section 5.6, we present our MAPSIN join approach that is designed to overcome

this drawback by using the distributed index of the NoSQL data store HBase.

5.2.3 P ig l atin

Pig Latin [8] is a language for the analysis of very large data sets based on Apache

Hadoop developed by Yahoo! Research. The implementation of Pig Latin for

Hadoop, Pig , is an Apache top-level project that automatically translates a Pig Latin

program into a series of MapReduce jobs.

Data Model: Pig Latin has a fully nested data model that allows more flexibility

than at tables required by the first normal form in relational databases. The data

model of Pig Latin provides four different types:

•

Atom : Contains a simple atomic value like a string or number, for example,

'Sarah' or 24 .

•

Tuple : Sequence of fields of any type. Every field can have a name (alias)

that can be used to reference the field, for example, ('John' , 'Doe') with

alias ( (firstname, , lastname ).

•

Bag : Collection of tuples with possible duplicates. The schemas of the tuples

do not have to match, that is, the number and types of fields can differ.

('Bob','Sarah')

'Peter',('likes','football')

(

)

Large Scale and Big Data: Processing and Management

Search WWH ::

Custom Search

Home