Large-Scale File Systems and Map-Reduce - Mining of Massive Datasets

Databases Reference

In-Depth Information

element x i is given by

n

x i =

m ij v j

j=1

If n = 100, we do not want to use a DFS or map-reduce for this calculation.

But this sort of calculation is at the heart of the ranking of Web pages that goes

on at search engines, and there, n is in the tens of billions. 3 Let us first assume

that n is large, but not so large that vector v cannot fit in main memory, and

be part of the input to every Map task. It is useful to observe at this time that

there is nothing in the definition of map-reduce that forbids providing the same

input to more than one Map task.

The matrix M and the vector v each will be stored in a file of the DFS. We

assume that the row-column coordinates of each matrix element will be discov-

erable, either from its position in the file, or because it is stored with explicit

coordinates, as a triple (i, j, m ij ). We also assume the position of element v j in

the vector v will be discoverable in the analogous way.

The Map Function: Each Map task will take the entire vector v and a chunk

of the matrix M . From each matrix element m ij it produces the key-value pair

(i, m ij v j ). Thus, all terms of the sum that make up the component x i of the

matrix-vector product will get the same key.

The Reduce Function: A Reduce task has simply to sum all the values as-

sociated with a given key i. The result will be a pair (i, x i ).

2.3.2 If the Vector v Cannot Fit in Main Memory

However, it is possible that the vector v is so large that it will not fit in its

entirety in main memory. We don't have to fit it in main memory at a compute

node, but if we do not then there will be a very large number of disk accesses

as we move pieces of the vector into main memory to multiply components by

elements of the matrix. Thus, as an alternative, we can divide the matrix into

vertical stripes of equal width and divide the vector into an equal number of

horizontal stripes, of the same height. Our goal is to use enough stripes so that

the portion of the vector in one stripe can fit conveniently into main memory at

a compute node. Figure 2.4 suggests what the partition looks like if the matrix

and vector are each divided into five stripes.

The ith stripe of the matrix multiplies only components from the ith stripe

of the vector. Thus, we can divide the matrix into one file for each stripe, and

do the same for the vector. Each Map task is assigned a chunk from one of

the stripes of the matrix and gets the entire corresponding stripe of the vector.

The Map and Reduce tasks can then act exactly as was described above for the

case where Map tasks get the entire vector.

3 The matrix is sparse, with on the average of 10 to 15 nonzero elements per row, since the

matrix represents the links in the Web, with m ij nonzero if and only if there is a link from

page j to page i. Note that there is no way we could store a dense matrix whose side was

10 10 , since it would have 10 20 elements.

Mining of Massive Datasets

Search WWH ::

Custom Search

Home