Link Analysis - Mining of Massive Datasets

Database Reference

In-Depth Information

5.5.3

Exercises for Section 5.5

EXERCISE 5.5.1 Compute the hubbiness and authority of each of the nodes in our original

Web graph of Fig. 5.1 .

! EXERCISE 5.5.2 Suppose our graph is a chain of n nodes, as was suggested by Fig. 5.9 .

Compute the hubs and authorities vectors, as a function of n .

5.6 Summary of Chapter 5

✦ Term Spam : Early search engines were unable to deliver relevant results because they were vulnerable to term spam

- the introduction into Web pages of words that misrepresented what the page was about.

✦ The Google Solution to Term Spam : Google was able to counteract term spam by two techniques. First was the

PageRank algorithm for determining the relative importance of pages on the Web. The second was a strategy of be-

lieving what other pages said about a given page, in or near their links to that page, rather than believing only what

the page said about itself.

✦ PageRank : PageRank is an algorithm that assigns a real number, called its PageRank, to each page on the Web. The

PageRank of a page is a measure of how important the page is, or how likely it is to be a good response to a search

query. In its simplest form, PageRank is a solution to the recursive equation “a page is important if important pages

link to it.”

✦ Transition Matrix of the Web : We represent links in the Web by a matrix whose i th row and i th column represent the

i th page of the Web. If there are one or more links from page j to page i , then the entry in row i and column j is 1/ k ,

where k is the number of pages to which page j links. Other entries of the transition matrix are 0.

✦ Computing PageRank on Strongly Connected Web Graphs : For strongly connected Web graphs (those where any

node can reach any other node), Page-Rank is the principal eigenvector of the transition matrix. We can compute

PageRank by starting with any nonzero vector and repeatedly multiplying the current vector by the transition matrix,

to get a better estimate. 7 After about 50 iterations, the estimate will be very close to the limit, which is the true

PageRank.

✦ The Random Surfer Model : Calculation of PageRank can be thought of as simulating the behavior of many random

surfers, who each start at a random page and at any step move, at random, to one of the pages to which their current

page links. The limiting probability of a surfer being at a given page is the PageRank of that page. The intuition is

that people tend to create links to the pages they think are useful, so random surfers will tend to be at a useful page.

✦ Dead Ends : A dead end is a Web page with no links out. The presence of dead ends will cause the PageRank of some

or all of the pages to go to 0 in the iterative computation, including pages that are not dead ends. We can eliminate

all dead ends before undertaking a PageRank calculation by recursively dropping nodes with no arcs out. Note that

dropping one node can cause another, which linked only to it, to become a dead end, so the process must be recurs-

ive.

✦ Spider Traps : A spider trap is a set of nodes that, while they may link to each other, have no links out to other nodes.

In an iterative calculation of Page-Rank, the presence of spider traps cause all the PageRank to be captured within

that set of nodes.

✦ Taxation Schemes : To counter the effect of spider traps (and of dead ends, if we do not eliminate them), PageRank is

normally computed in a way that modifies the simple iterative multiplication by the transition matrix. A parameter β

is chosen, typically around 0.85. Given an estimate of the PageRank, the next estimate is computed by multiplying

the estimate by β times the transition matrix, and then adding (1 − β )/ n to the estimate for each page, where n is the

total number of pages.

Search WWH ::

Custom Search

Home