Databases Reference
In-Depth Information
5.2.2 PageRank Iteration Using Map-Reduce
One iteration of the PageRank algorithm involves taking an estimated Page-
Rank vector v and computing the next estimate v by
v = βM v + (1−β)e/n
Recall β is a constant slightly less than 1, e is a vector of all 1's, and n is the
number of nodes in the graph that transition matrix M represents.
If n is small enough that each Map task can store the full vector v in main
memory and also have room in main memory for the result vector v , then there
is little more here than a matrix-vector multiplication. The additional steps
are to multiply each component of M v by constant β and to add (1−β)/n to
each component.
However, it is likely, given the size of the Web today, that v is much too
large to fit in main memory. As we discussed in Section 2.3.1, the method
of striping, where we break M into vertical stripes (see Fig. 2.4) and break v
into corresponding horizontal stripes, will allow us to execute the map-reduce
process e ciently, with no more of v at any one Map task than can conveniently
fit in main memory.
5.2.3 Use of Combiners to Consolidate the Result Vector
There are two reasons the method of Section 5.2.2 might not be adequate.
1. We might wish to add terms for v i , the ith component of the result vector
v, at the Map tasks. This improvement is the same as using a combiner,
since the Reduce function simply adds terms with a common key. Recall
that for a map-reduce implementation of matrix-vector multiplication,
the key is the value of i for which a term m ij v j is intended.
2. We might not be using map-reduce at all, but rather executing the itera-
tion step at a single machine or a collection of machines.
We shall assume that we are trying to implement a combiner in conjunction
with a Map task; the second case uses essentially the same idea.
Suppose that we are using the stripe method to partition a matrix and
vector that do not fit in main memory. Then a vertical stripe from the matrix
M and a horizontal stripe from the vector v will contribute to all components
of the result vector v . Since that vector is the same length as v, it will not
fit in main memory either. Moreover, as M is stored column-by-column for
e ciency reasons, a column can affect any of the components of v . As a result,
it is unlikely that when we need to add a term to some component v i , that
component will already be in main memory. Thus, most terms will require
that a page be brought into main memory to add it to the proper component.
That situation, called thrashing, takes orders of magnitude too much time to
be feasible.
Search WWH ::




Custom Search