Databases Reference
In-Depth Information
element x i is given by
n
x i =
m ij v j
j=1
If n = 100, we do not want to use a DFS or map-reduce for this calculation.
But this sort of calculation is at the heart of the ranking of Web pages that goes
on at search engines, and there, n is in the tens of billions. 3 Let us first assume
that n is large, but not so large that vector v cannot fit in main memory, and
be part of the input to every Map task. It is useful to observe at this time that
there is nothing in the definition of map-reduce that forbids providing the same
input to more than one Map task.
The matrix M and the vector v each will be stored in a file of the DFS. We
assume that the row-column coordinates of each matrix element will be discov-
erable, either from its position in the file, or because it is stored with explicit
coordinates, as a triple (i, j, m ij ). We also assume the position of element v j in
the vector v will be discoverable in the analogous way.
The Map Function: Each Map task will take the entire vector v and a chunk
of the matrix M . From each matrix element m ij it produces the key-value pair
(i, m ij v j ). Thus, all terms of the sum that make up the component x i of the
matrix-vector product will get the same key.
The Reduce Function: A Reduce task has simply to sum all the values as-
sociated with a given key i. The result will be a pair (i, x i ).
2.3.2 If the Vector v Cannot Fit in Main Memory
However, it is possible that the vector v is so large that it will not fit in its
entirety in main memory. We don't have to fit it in main memory at a compute
node, but if we do not then there will be a very large number of disk accesses
as we move pieces of the vector into main memory to multiply components by
elements of the matrix. Thus, as an alternative, we can divide the matrix into
vertical stripes of equal width and divide the vector into an equal number of
horizontal stripes, of the same height. Our goal is to use enough stripes so that
the portion of the vector in one stripe can fit conveniently into main memory at
a compute node. Figure 2.4 suggests what the partition looks like if the matrix
and vector are each divided into five stripes.
The ith stripe of the matrix multiplies only components from the ith stripe
of the vector. Thus, we can divide the matrix into one file for each stripe, and
do the same for the vector. Each Map task is assigned a chunk from one of
the stripes of the matrix and gets the entire corresponding stripe of the vector.
The Map and Reduce tasks can then act exactly as was described above for the
case where Map tasks get the entire vector.
3 The matrix is sparse, with on the average of 10 to 15 nonzero elements per row, since the
matrix represents the links in the Web, with m ij nonzero if and only if there is a link from
page j to page i. Note that there is no way we could store a dense matrix whose side was
10 10 , since it would have 10 20 elements.
Search WWH ::




Custom Search