Link Analysis - Mining of Massive Datasets

Database Reference

In-Depth Information

5.2.2

PageRank Iteration Using MapReduce

One iteration of the PageRank algorithm involves taking an estimated PageRank vector v

and computing the next estimate v ′ by

v ′ = βM v + (1 − β ) e / n

Recall β is a constant slightly less than 1, e is a vector of all 1s, and n is the number of

nodes in the graph that transition matrix M represents.

If n is small enough that each Map task can store the full vector v in main memory and

also have room in main memory for the result vector v ′, then there is little more here than

a matrix-vector multiplication. The additional steps are to multiply each component of M v

by constant β and to add (1 − β )/ n to each component.

However, it is likely, given the size of the Web today, that v is much too large to fit in

main memory. As we discussed in Section 2.3.1 , the method of striping, where we break

M into vertical stripes (see Fig. 2.4 ) and break v into corresponding horizontal stripes, will

allow us to execute the MapReduce process efficiently, with no more of v at any one Map

task than can conveniently fit in main memory.

5.2.3

Use of Combiners to Consolidate the Result Vector

There are two reasons the method of Section 5.2.2 might not be adequate.

(1) We might wish to add terms for the i th component of the result vector v , at the Map

tasks. This improvement is the same as using a combiner, since the Reduce function

simply adds terms with a common key. Recall that for a MapReduce implementation

of matrix-vector multiplication, the key is the value of i for which a term m ij v j is in-

tended.

(2) We might not be using MapReduce at all, but rather executing the iteration step at a

single machine or a collection of machines.

We shall assume that we are trying to implement a combiner in conjunction with a Map

task; the second case uses essentially the same idea.

Suppose that we are using the stripe method to partition a matrix and vector that do not

fit in main memory. Then a vertical stripe from the matrix M and a horizontal stripe from

the vector v will contribute to all components of the result vector v ′. Since that vector is the

same length as v , it will not fit in main memory either. Moreover, as M is stored column-

by-column for efficiency reasons, a column can affect any of the components of v ′. As a

result, it is unlikely that when we need to add a term to some component that component

Search WWH ::

Custom Search

Home