Link Analysis - Mining of Massive Datasets

Database Reference

In-Depth Information

Figure 5.2 The “bowtie” picture of the Web

(1) The in-component , consisting of pages that could reach the SCC by following links,

but were not reachable from the SCC.

(2) The out-component , consisting of pages reachable from the SCC but unable to reach

the SCC.

(3) Tendrils , which are of two types. Some tendrils consist of pages reachable from the

in-component but not able to reach the in-component. The other tendrils can reach the

out-component, but are not reachable from the out-component.

In addition, there were small numbers of pages found either in

(a) Tubes , which are pages reachable from the in-component and able to reach the out-

component, but unable to reach the SCC or be reached from the SCC.

(b) Isolated components that are unreachable from the large components (the SCC, in- and

out-components) and unable to reach those components.

Several of these structures violate the assumptions needed for the Markov-process itera-

tion to converge to a limit. For example, when a random surfer enters the out-component,

they can never leave. As a result, surfers starting in either the SCC or in-component are go-

ing to wind up in either the out-component or a tendril off the in-component. Thus, no page

in the SCC or in-component winds up with any probability of a surfer being there. If we

interpret this probability as measuring the importance of a page, then we conclude falsely

that nothing in the SCC or in-component is of any importance.

As a result, PageRank is usually modified to prevent such anomalies. There are really

two problems we need to avoid. First is the dead end, a page that has no links out. Surfers

reaching such a page disappear, and the result is that in the limit no page that can reach a

dead end can have any PageRank at all. The second problem is groups of pages that all have

outlinks but they never link to any other pages. These structures are called spider traps . 3

Both these problems are solved by a method called “taxation,” where we assume a random

surfer has a finite probability of leaving the Web at any step, and new surfers are started at

each page. We shall illustrate this process as we study each of the two problem cases.

5.1.4

Avoiding Dead Ends

Recall that a page with no link out is called a dead end. If we allow dead ends, the transition

matrix of the Web is no longer stochastic, since some of the columns will sum to 0 rather

than 1. A matrix whose column sums are at most 1 is called substochastic . If we compute

Search WWH ::

Custom Search

Home