Database Reference
In-Depth Information
Figure 5.2 The “bowtie” picture of the Web
(1) The in-component , consisting of pages that could reach the SCC by following links,
but were not reachable from the SCC.
(2) The out-component , consisting of pages reachable from the SCC but unable to reach
the SCC.
(3) Tendrils , which are of two types. Some tendrils consist of pages reachable from the
in-component but not able to reach the in-component. The other tendrils can reach the
out-component, but are not reachable from the out-component.
In addition, there were small numbers of pages found either in
(a) Tubes , which are pages reachable from the in-component and able to reach the out-
component, but unable to reach the SCC or be reached from the SCC.
(b) Isolated components that are unreachable from the large components (the SCC, in- and
out-components) and unable to reach those components.
Several of these structures violate the assumptions needed for the Markov-process itera-
tion to converge to a limit. For example, when a random surfer enters the out-component,
they can never leave. As a result, surfers starting in either the SCC or in-component are go-
ing to wind up in either the out-component or a tendril off the in-component. Thus, no page
in the SCC or in-component winds up with any probability of a surfer being there. If we
interpret this probability as measuring the importance of a page, then we conclude falsely
that nothing in the SCC or in-component is of any importance.
As a result, PageRank is usually modified to prevent such anomalies. There are really
two problems we need to avoid. First is the dead end, a page that has no links out. Surfers
reaching such a page disappear, and the result is that in the limit no page that can reach a
dead end can have any PageRank at all. The second problem is groups of pages that all have
outlinks but they never link to any other pages. These structures are called spider traps . 3
Both these problems are solved by a method called “taxation,” where we assume a random
surfer has a finite probability of leaving the Web at any step, and new surfers are started at
each page. We shall illustrate this process as we study each of the two problem cases.
5.1.4
Avoiding Dead Ends
Recall that a page with no link out is called a dead end. If we allow dead ends, the transition
matrix of the Web is no longer stochastic, since some of the columns will sum to 0 rather
than 1. A matrix whose column sums are at most 1 is called substochastic . If we compute
Search WWH ::




Custom Search