Link Analysis - Mining of Massive Datasets

Database Reference

In-Depth Information

will be M ( M v 0 ) = M 2 v 0 , and so on. In general, multiplying the initial vector v 0 by M a total

of i times will give us the distribution of the surfer after i steps.

To see why multiplying a distribution vector v by M gives the distribution x = M v at the

next step, we reason as follows. The probability x i that a random surfer will be at node i

at the next step, is ∑ j m ij v j . Here, m ij is the probability that a surfer at node j will move to

node i at the next step (often 0 because there is no link from j to i ), and v j is the probability

that the surfer was at node j at the previous step.

This sort of behavior is an example of the ancient theory of Markov processes . It is

known that the distribution of the surfer approaches a limiting distribution v that satisfies v

= M v , provided two conditions are met:

(1) The graph is strongly connected ; that is, it is possible to get from any node to any other

node.

(2) There are no dead ends : nodes that have no arcs out.

Note that Fig. 5.1 satisfies both these conditions.

The limit is reached when multiplying the distribution by M another time does not

change the distribution. In other terms, the limiting v is an eigenvector of M (an eigenvector

of a matrix M is a vector v that satisfies v = λ M v for some constant eigenvalue λ). In fact,

because M is stochastic , meaning that its columns each add up to 1, v is the principal eigen-

vector (its associated eigenvalue is the largest of all eigenvalues). Note also that, because

M is stochastic, the eigenvalue associated with the principal eigenvector is 1.

The principal eigenvector of M tells us where the surfer is most likely to be after a long

time. Recall that the intuition behind PageRank is that the more likely a surfer is to be at

a page, the more important the page is. We can compute the principal eigenvector of M by

starting with the initial vector v 0 and multiplying by M some number of times, until the

vector we get shows little change at each round. In practice, for the Web itself, 50-75 iter-

ations are sufficient to converge to within the error limits of double-precision arithmetic.

EXAMPLE 5.2 Suppose we apply the process described above to the matrix M from

Example 5.1 . Since there are four nodes, the initial vector v 0 has four components, each

1/4. The sequence of approximations to the limit that we get by multiplying at each step by

M is:

Notice that in this example, the probabilities for B , C , and D remain the same. It is easy

to see that B and C must always have the same values at any iteration, because their rows in

M are identical. To show that their values are also the same as the value for D , an inductive

Search WWH ::

Custom Search

Home