Database Reference
In-Depth Information
5.2.2
PageRank Iteration Using MapReduce
One iteration of the PageRank algorithm involves taking an estimated PageRank vector v
and computing the next estimate v ′ by
v ′ = βM v + (1 − β ) e / n
Recall β is a constant slightly less than 1, e is a vector of all 1s, and n is the number of
nodes in the graph that transition matrix M represents.
If n is small enough that each Map task can store the full vector v in main memory and
also have room in main memory for the result vector v ′, then there is little more here than
a matrix-vector multiplication. The additional steps are to multiply each component of M v
by constant β and to add (1 − β )/ n to each component.
However, it is likely, given the size of the Web today, that v is much too large to fit in
main memory. As we discussed in Section 2.3.1 , the method of striping, where we break
M into vertical stripes (see Fig. 2.4 ) and break v into corresponding horizontal stripes, will
allow us to execute the MapReduce process efficiently, with no more of v at any one Map
task than can conveniently fit in main memory.
5.2.3
Use of Combiners to Consolidate the Result Vector
There are two reasons the method of Section 5.2.2 might not be adequate.
(1) We might wish to add terms for the i th component of the result vector v , at the Map
tasks. This improvement is the same as using a combiner, since the Reduce function
simply adds terms with a common key. Recall that for a MapReduce implementation
of matrix-vector multiplication, the key is the value of i for which a term m ij v j is in-
tended.
(2) We might not be using MapReduce at all, but rather executing the iteration step at a
single machine or a collection of machines.
We shall assume that we are trying to implement a combiner in conjunction with a Map
task; the second case uses essentially the same idea.
Suppose that we are using the stripe method to partition a matrix and vector that do not
fit in main memory. Then a vertical stripe from the matrix M and a horizontal stripe from
the vector v will contribute to all components of the result vector v ′. Since that vector is the
same length as v , it will not fit in main memory either. Moreover, as M is stored column-
by-column for efficiency reasons, a column can affect any of the components of v ′. As a
result, it is unlikely that when we need to add a term to some component that component
Search WWH ::




Custom Search