MapReduce and the New Software Stack - Mining of Massive Datasets

Database Reference

In-Depth Information

✦ Replication Rate and Reducer Size : It is often convenient to measure communication by the replication rate, which is

the communication per input. Also, the reducer size is the maximum number of inputs associated with any reducer.

For many problems, it is possible to derive a lower bound on replication rate as a function of the reducer size.

✦ Representing Problems as Graphs : It is possible to represent many problems that are amenable to MapReduce com-

putation by a graph in which nodes represent inputs and outputs. An output is connected to all the inputs that are

needed to compute that output.

✦ Mapping Schemas : Given the graph of a problem, and given a reducer size, a mapping schema is an assignment

of the inputs to one or more reducers so that no reducer is assigned more inputs than the reducer size permits, and

yet for every output there is some reducer that gets all the inputs needed to compute that output. The requirement

that there be a mapping schema for any MapReduce algorithm is a good expression of what makes MapReduce al-

gorithms different from general parallel computations.

✦ Matrix Multiplication by MapReduce : There is a family of one-pass Map-Reduce algorithms that performs multi-

plication of n × n matrices with the minimum possible replication rate r = 2 n 2 /q , where q is the reducer size. On the

other hand, a two-pass MapReduce algorithm for the same problem with the same reducer size can use up to a factor

of n less communication.

2.8 References for Chapter 2

GFS, the Google File System, was described in [ 10 ] . The paper on Google's MapReduce

is [ 8 ] . Information about Hadoop and HDFS can be found at [ 11 ] . More detail on relations

and relational algebra can be found in [ 16 ] .

Clustera is covered in [ 9 ]. Hyracks (previously called Hyrax) is from [ 4 ]. The Dryad

system [ 13 ] has similar capabilities, but requires user creation of parallel tasks. That re-

sponsibility was automated through the introduction of DryadLINQ [ 17 ] . For a discussion

of cluster implementation of recursion, see [ 1 ] . Pregel is from [ 14 ] .

A different approach to recursion was taken in Haloop [ 5 ]. There, recursion is seen as an

iteration, with the output of one round being input to the next round. Efficiency is obtained

by managing the location of the intermediate data and the tasks that implement each round.

There are a number of other systems built on a distributed file system and/or MapRe-

duce, which have not been covered here, but may be worth knowing about. [ 6 ] describes

BigTable , a Google implementation of an object store of very large size. A somewhat dif-

ferent direction was taken at Yahoo! with Pnuts [ 7 ] . The latter supports a limited form of

transaction processing, for example.

PIG [ 15 ] is an implementation of relational algebra on top of Hadoop. Similarly, Hive

[ 12 ] implements a restricted form of SQL on top of Hadoop.

The communication-cost model for MapReduce algorithms and the optimal implement-

ations of multiway joins is from [ 3 ]. The material on replication rate, reducer size, and their

relationship is from [ 2 ] . Solutions to Exercises 2.6.2 and 2.6.3 can be found there.

[1] F.N. Afrati, V. Borkar, M. Carey, A. Polyzotis, and J.D. Ullman, “Cluster computing, recursion, and Datalog,” to

appear in Proc. Datalog 2.0 Workshop , Elsevier, 2011.

Search WWH ::

Custom Search

Home