Database Reference
In-Depth Information
These two techniques together make it very hard for the hypothetical shirt vendor to fool
Google. While the shirt-seller can still add “movie” to his page, the fact that Google be-
lieved what other pages say about him, over what he says about himself would negate the
use of false terms. The obvious countermeasure is for the shirt seller to create many pages
of his own, and link to his shirt-selling page with a link that says “movie.” But those pages
would not be given much importance by PageRank, since other pages would not link to
them. The shirt-seller could create many links among his own pages, but none of these
pages would get much importance according to the PageRank algorithm, and therefore, he
still would not be able to fool Google into thinking his page was about movies.
Simplified PageRank Doesn't Work
As we shall see, computing PageRank by simulating random surfers is a time-consuming process. One might think
that simply counting the number of in-links for each page would be a good approximation to where random surfers
would wind up. However, if that is all we did, then the hypothetical shirt-seller could simply create a “spam farm” of
a million pages, each of which linked to his shirt page. Then, the shirt page looks very important indeed, and a search
engine would be fooled.
It is reasonable to ask why simulation of random surfers should allow us to approximate
the intuitive notion of the “importance” of pages. There are two related motivations that
inspired this approach.
• Users of the Web “vote with their feet.” They tend to place links to pages they
think are good or useful pages to look at, rather than bad or useless pages.
• The behavior of a random surfer indicates which pages users of the Web are likely
to visit. Users are more likely to visit useful pages than useless pages.
But regardless of the reason, the PageRank measure has been proved empirically to work,
and so we shall study in detail how it is computed.
5.1.2
Definition of PageRank
PageRank is a function that assigns a real number to each page in the Web (or at least to
that portion of the Web that has been crawled and its links discovered). The intent is that
the higher the PageRank of a page, the more “important” it is. There is not one fixed al-
gorithm for assignment of PageRank, and in fact variations on the basic idea can alter the
relative PageRank of any two pages. We begin by defining the basic, idealized PageRank,
and follow it by modifications that are necessary for dealing with some real-world prob-
lems concerning the structure of the Web.
Search WWH ::




Custom Search