Database Reference
In-Depth Information
Map phase
Shue and sort
Reduce phase
Map
Input 0
Reduce
Output 0
Input 1
Input 2
Input 3
Input 4
Input 5
Map
Reduce
Output 1
Map
FIGURE 5.2
MapReduce data flow.
using the GRAPH operator. For a detailed definition of the SPARQL syntax we refer
the interested reader to the official W3C Recommendation [2]. A formal definition of
the SPARQL semantics can also be found in [12]. The SPARQL query in Figure 5.1
returns all persons who know “Peter” and are at least 18 years old together with their
mailboxes, if they exist. Executed on the corresponding RDF graph there are two
results for “John” and “Bob” where only “Bob” has a known email address.
5.2.2 m aP r eDuCe
The MapReduce programming model [4] enables scalable, fault tolerant, and massively
parallel computations using a computer cluster. The basis of Google's MapReduce is the
distributed file system GFS [13] where large files are split into equal sized blocks, spread
across the cluster and fault tolerance is achieved by replication. We use Apache Hadoop *
as it is the most popular open-source implementation of Google's GFS and MapReduce
framework that is used by many companies like Yahoo!, IBM, or Facebook.
The workflow of a MapReduce program is a sequence of MapReduce iterations
each consisting of a map and a reduce phase separated by a so-called Shuffle and Sort
phase (see Figure 5.2). A user has to implement map and reduce functions, which
are automatically executed in parallel on a subset of the data. The map function gets
invoked for every input record represented as a key-value pair. It outputs a list of new
intermediate key-value pairs that are then sorted and grouped by their key. The reduce
function gets invoked for every distinct intermediate key together with the list of all
according values and outputs a list of values that can be used as input for the next
MapReduce iteration. The signatures of the map and reduce functions are therefore
as follows:
map: (inKey, inValue) -> list(outKey, tmpValue)
reduce: (outKey, list(tmpValue)) -> list(outValue)
5.2.2.1 Map-Side vs. Reduce-Side Join
Processing joins with MapReduce is a challenging task as data sets are typically very
large [14,15]. If we want to join two data sets with MapReduce, L R , we have to
* http://hadoop.apache.org/.
http://wiki.apache.org/hadoop/PoweredBy.
Search WWH ::




Custom Search