Database Reference
In-Depth Information
Replication Rate and Reducer Size : It is often convenient to measure communication by the replication rate, which is
the communication per input. Also, the reducer size is the maximum number of inputs associated with any reducer.
For many problems, it is possible to derive a lower bound on replication rate as a function of the reducer size.
Representing Problems as Graphs : It is possible to represent many problems that are amenable to MapReduce com-
putation by a graph in which nodes represent inputs and outputs. An output is connected to all the inputs that are
needed to compute that output.
Mapping Schemas : Given the graph of a problem, and given a reducer size, a mapping schema is an assignment
of the inputs to one or more reducers so that no reducer is assigned more inputs than the reducer size permits, and
yet for every output there is some reducer that gets all the inputs needed to compute that output. The requirement
that there be a mapping schema for any MapReduce algorithm is a good expression of what makes MapReduce al-
gorithms different from general parallel computations.
Matrix Multiplication by MapReduce : There is a family of one-pass Map-Reduce algorithms that performs multi-
plication of n × n matrices with the minimum possible replication rate r = 2 n 2 /q , where q is the reducer size. On the
other hand, a two-pass MapReduce algorithm for the same problem with the same reducer size can use up to a factor
of n less communication.
2.8 References for Chapter 2
GFS, the Google File System, was described in [ 10 ] . The paper on Google's MapReduce
is [ 8 ] . Information about Hadoop and HDFS can be found at [ 11 ] . More detail on relations
and relational algebra can be found in [ 16 ] .
Clustera is covered in [ 9 ]. Hyracks (previously called Hyrax) is from [ 4 ]. The Dryad
system [ 13 ] has similar capabilities, but requires user creation of parallel tasks. That re-
sponsibility was automated through the introduction of DryadLINQ [ 17 ] . For a discussion
of cluster implementation of recursion, see [ 1 ] . Pregel is from [ 14 ] .
A different approach to recursion was taken in Haloop [ 5 ]. There, recursion is seen as an
iteration, with the output of one round being input to the next round. Efficiency is obtained
by managing the location of the intermediate data and the tasks that implement each round.
There are a number of other systems built on a distributed file system and/or MapRe-
duce, which have not been covered here, but may be worth knowing about. [ 6 ] describes
BigTable , a Google implementation of an object store of very large size. A somewhat dif-
ferent direction was taken at Yahoo! with Pnuts [ 7 ] . The latter supports a limited form of
transaction processing, for example.
PIG [ 15 ] is an implementation of relational algebra on top of Hadoop. Similarly, Hive
[ 12 ] implements a restricted form of SQL on top of Hadoop.
The communication-cost model for MapReduce algorithms and the optimal implement-
ations of multiway joins is from [ 3 ]. The material on replication rate, reducer size, and their
relationship is from [ 2 ] . Solutions to Exercises 2.6.2 and 2.6.3 can be found there.
[1] F.N. Afrati, V. Borkar, M. Carey, A. Polyzotis, and J.D. Ullman, “Cluster computing, recursion, and Datalog,” to
appear in Proc. Datalog 2.0 Workshop , Elsevier, 2011.
Search WWH ::




Custom Search