Link Analysis - Mining of Massive Datasets

Databases Reference

In-Depth Information

5.2.2 PageRank Iteration Using Map-Reduce

One iteration of the PageRank algorithm involves taking an estimated Page-

Rank vector v and computing the next estimate v ′ by

v ′ = βM v + (1−β)e/n

Recall β is a constant slightly less than 1, e is a vector of all 1's, and n is the

number of nodes in the graph that transition matrix M represents.

If n is small enough that each Map task can store the full vector v in main

memory and also have room in main memory for the result vector v ′ , then there

is little more here than a matrix-vector multiplication. The additional steps

are to multiply each component of M v by constant β and to add (1−β)/n to

each component.

However, it is likely, given the size of the Web today, that v is much too

large to fit in main memory. As we discussed in Section 2.3.1, the method

of striping, where we break M into vertical stripes (see Fig. 2.4) and break v

into corresponding horizontal stripes, will allow us to execute the map-reduce

process e ciently, with no more of v at any one Map task than can conveniently

fit in main memory.

5.2.3 Use of Combiners to Consolidate the Result Vector

There are two reasons the method of Section 5.2.2 might not be adequate.

1. We might wish to add terms for v ′ i , the ith component of the result vector

v, at the Map tasks. This improvement is the same as using a combiner,

since the Reduce function simply adds terms with a common key. Recall

that for a map-reduce implementation of matrix-vector multiplication,

the key is the value of i for which a term m ij v j is intended.

2. We might not be using map-reduce at all, but rather executing the itera-

tion step at a single machine or a collection of machines.

We shall assume that we are trying to implement a combiner in conjunction

with a Map task; the second case uses essentially the same idea.

Suppose that we are using the stripe method to partition a matrix and

vector that do not fit in main memory. Then a vertical stripe from the matrix

M and a horizontal stripe from the vector v will contribute to all components

of the result vector v ′ . Since that vector is the same length as v, it will not

fit in main memory either. Moreover, as M is stored column-by-column for

e ciency reasons, a column can affect any of the components of v ′ . As a result,

it is unlikely that when we need to add a term to some component v ′ i , that

component will already be in main memory. Thus, most terms will require

that a page be brought into main memory to add it to the proper component.

That situation, called thrashing, takes orders of magnitude too much time to

be feasible.

Search WWH ::

Custom Search

Home