MapReduce and the New Software Stack - Mining of Massive Datasets

Database Reference

In-Depth Information

2.3 Algorithms Using MapReduce

MapReduce is not a solution to every problem, not even every problem that profitably can

use many compute nodes operating in parallel. As we mentioned in Section 2.1.2 , the en-

tire distributed-file-system milieu makes sense only when files are very large and are rarely

updated in place. Thus, we would not expect to use either a DFS or an implementation of

MapReduce for managing on-line retail sales, even though a large on-line retailer such as

Amazon.com uses thousands of compute nodes when processing requests over the Web.

The reason is that the principal operations on Amazon data involve responding to searches

for products, recording sales, and so on, processes that involve relatively little calculation

and that change the database. 2 On the other hand, Amazon might use MapReduce to per-

form certain analytic queries on large amounts of data, such as finding for each user those

users whose buying patterns were most similar.

The original purpose for which the Google implementation of MapReduce was created

was to execute very large matrix-vector multiplications as are needed in the calculation of

PageRank (See Chapter 5 ). We shall see that matrix-vector and matrix-matrix calculations

fit nicely into the MapReduce style of computing. Another important class of operations

that can use MapReduce effectively are the relational-algebra operations. We shall examine

the MapReduce execution of these operations as well.

2.3.1

Matrix-Vector Multiplication by MapReduce

Suppose we have an n × n matrix M , whose element in row i and column j will be denoted

m ij . Suppose we also have a vector v of length n , whose j th element is v j . Then the matrix-

vector product is the vector x of length n , whose i th element x i is given by

If n = 100, we do not want to use a DFS or MapReduce for this calculation. But this sort

of calculation is at the heart of the ranking of Web pages that goes on at search engines,

and there, n is in the tens of billions. 3 Let us first assume that n is large, but not so large

that vector v cannot fit in main memory and thus be available to every Map task.

The matrix M and the vector v each will be stored in a file of the DFS. We assume that

the row-column coordinates of each matrix element will be discoverable, either from its

position in the file, or because it is stored with explicit coordinates, as a triple ( i, j, m ij ). We

also assume the position of element v j in the vector v will be discoverable in the analogous

way.

Search WWH ::

Custom Search

Home