Database Reference
In-Depth Information
5.5.3
Exercises for Section 5.5
EXERCISE 5.5.1 Compute the hubbiness and authority of each of the nodes in our original
Web graph of Fig. 5.1 .
! EXERCISE 5.5.2 Suppose our graph is a chain of n nodes, as was suggested by Fig. 5.9 .
Compute the hubs and authorities vectors, as a function of n .
5.6 Summary of Chapter 5
Term Spam : Early search engines were unable to deliver relevant results because they were vulnerable to term spam
- the introduction into Web pages of words that misrepresented what the page was about.
The Google Solution to Term Spam : Google was able to counteract term spam by two techniques. First was the
PageRank algorithm for determining the relative importance of pages on the Web. The second was a strategy of be-
lieving what other pages said about a given page, in or near their links to that page, rather than believing only what
the page said about itself.
PageRank : PageRank is an algorithm that assigns a real number, called its PageRank, to each page on the Web. The
PageRank of a page is a measure of how important the page is, or how likely it is to be a good response to a search
query. In its simplest form, PageRank is a solution to the recursive equation “a page is important if important pages
link to it.”
Transition Matrix of the Web : We represent links in the Web by a matrix whose i th row and i th column represent the
i th page of the Web. If there are one or more links from page j to page i , then the entry in row i and column j is 1/ k ,
where k is the number of pages to which page j links. Other entries of the transition matrix are 0.
Computing PageRank on Strongly Connected Web Graphs : For strongly connected Web graphs (those where any
node can reach any other node), Page-Rank is the principal eigenvector of the transition matrix. We can compute
PageRank by starting with any nonzero vector and repeatedly multiplying the current vector by the transition matrix,
to get a better estimate. 7 After about 50 iterations, the estimate will be very close to the limit, which is the true
PageRank.
The Random Surfer Model : Calculation of PageRank can be thought of as simulating the behavior of many random
surfers, who each start at a random page and at any step move, at random, to one of the pages to which their current
page links. The limiting probability of a surfer being at a given page is the PageRank of that page. The intuition is
that people tend to create links to the pages they think are useful, so random surfers will tend to be at a useful page.
Dead Ends : A dead end is a Web page with no links out. The presence of dead ends will cause the PageRank of some
or all of the pages to go to 0 in the iterative computation, including pages that are not dead ends. We can eliminate
all dead ends before undertaking a PageRank calculation by recursively dropping nodes with no arcs out. Note that
dropping one node can cause another, which linked only to it, to become a dead end, so the process must be recurs-
ive.
Spider Traps : A spider trap is a set of nodes that, while they may link to each other, have no links out to other nodes.
In an iterative calculation of Page-Rank, the presence of spider traps cause all the PageRank to be captured within
that set of nodes.
Taxation Schemes : To counter the effect of spider traps (and of dead ends, if we do not eliminate them), PageRank is
normally computed in a way that modifies the simple iterative multiplication by the transition matrix. A parameter β
is chosen, typically around 0.85. Given an estimate of the PageRank, the next estimate is computed by multiplying
the estimate by β times the transition matrix, and then adding (1 − β )/ n to the estimate for each page, where n is the
total number of pages.
Search WWH ::




Custom Search