Database Reference
In-Depth Information
2.3 Algorithms Using MapReduce
MapReduce is not a solution to every problem, not even every problem that profitably can
use many compute nodes operating in parallel. As we mentioned in Section 2.1.2 , the en-
tire distributed-file-system milieu makes sense only when files are very large and are rarely
updated in place. Thus, we would not expect to use either a DFS or an implementation of
MapReduce for managing on-line retail sales, even though a large on-line retailer such as
Amazon.com uses thousands of compute nodes when processing requests over the Web.
The reason is that the principal operations on Amazon data involve responding to searches
for products, recording sales, and so on, processes that involve relatively little calculation
and that change the database. 2 On the other hand, Amazon might use MapReduce to per-
form certain analytic queries on large amounts of data, such as finding for each user those
users whose buying patterns were most similar.
The original purpose for which the Google implementation of MapReduce was created
was to execute very large matrix-vector multiplications as are needed in the calculation of
PageRank (See Chapter 5 ). We shall see that matrix-vector and matrix-matrix calculations
fit nicely into the MapReduce style of computing. Another important class of operations
that can use MapReduce effectively are the relational-algebra operations. We shall examine
the MapReduce execution of these operations as well.
2.3.1
Matrix-Vector Multiplication by MapReduce
Suppose we have an n × n matrix M , whose element in row i and column j will be denoted
m ij . Suppose we also have a vector v of length n , whose j th element is v j . Then the matrix-
vector product is the vector x of length n , whose i th element x i is given by
If n = 100, we do not want to use a DFS or MapReduce for this calculation. But this sort
of calculation is at the heart of the ranking of Web pages that goes on at search engines,
and there, n is in the tens of billions. 3 Let us first assume that n is large, but not so large
that vector v cannot fit in main memory and thus be available to every Map task.
The matrix M and the vector v each will be stored in a file of the DFS. We assume that
the row-column coordinates of each matrix element will be discoverable, either from its
position in the file, or because it is stored with explicit coordinates, as a triple ( i, j, m ij ). We
also assume the position of element v j in the vector v will be discoverable in the analogous
way.
Search WWH ::




Custom Search