Large-Scale RDF Processing with MapReduce - Large Scale and Big Data: Processing and Management - page 153

Database Reference

In-Depth Information

Map phase

Shue and sort

Reduce phase

Map

Input 0

Reduce

Output 0

Input 1

Input 2

Input 3

Input 4

Input 5

Map

Reduce

Output 1

Map

FIGURE 5.2

MapReduce data flow.

using the GRAPH operator. For a detailed definition of the SPARQL syntax we refer

the interested reader to the official W3C Recommendation [2]. A formal definition of

the SPARQL semantics can also be found in [12]. The SPARQL query in Figure 5.1

returns all persons who know “Peter” and are at least 18 years old together with their

mailboxes, if they exist. Executed on the corresponding RDF graph there are two

results for “John” and “Bob” where only “Bob” has a known email address.

5.2.2 m aP r eDuCe

The MapReduce programming model [4] enables scalable, fault tolerant, and massively

parallel computations using a computer cluster. The basis of Google's MapReduce is the

distributed file system GFS [13] where large files are split into equal sized blocks, spread

across the cluster and fault tolerance is achieved by replication. We use Apache Hadoop *

as it is the most popular open-source implementation of Google's GFS and MapReduce

framework that is used by many companies like Yahoo!, IBM, or Facebook. †

The workflow of a MapReduce program is a sequence of MapReduce iterations

each consisting of a map and a reduce phase separated by a so-called Shuffle and Sort

phase (see Figure 5.2). A user has to implement map and reduce functions, which

are automatically executed in parallel on a subset of the data. The map function gets

invoked for every input record represented as a key-value pair. It outputs a list of new

intermediate key-value pairs that are then sorted and grouped by their key. The reduce

function gets invoked for every distinct intermediate key together with the list of all

according values and outputs a list of values that can be used as input for the next

MapReduce iteration. The signatures of the map and reduce functions are therefore

as follows:

map: (inKey, inValue) -> list(outKey, tmpValue)

reduce: (outKey, list(tmpValue)) -> list(outValue)

5.2.2.1 Map-Side vs. Reduce-Side Join

Processing joins with MapReduce is a challenging task as data sets are typically very

large [14,15]. If we want to join two data sets with MapReduce, L ⋈ R , we have to

* http://hadoop.apache.org/.

† http://wiki.apache.org/hadoop/PoweredBy.

Next Page

Large Scale and Big Data: Processing and Management

Search WWH ::

Custom Search

Home