Database Reference
In-Depth Information
5
Link Analysis
One of the biggest changes in our lives in the decade following the turn of the century was
the availability of efficient and accurate Web search, through search engines such as Google.
While Google was not the first search engine, it was the first able to defeat the spammers
who had made search almost useless. Moreover, the innovation provided by Google was a
nontrivial technological advance, called “PageRank.” We shall begin the chapter by explain-
ing what PageRank is and how it is computed efficiently.
Yet the war between those who want to make the Web useful and those who would exploit
it for their own purposes is never over. When PageRank was established as an essential tech-
nique for a search engine, spammers invented ways to manipulate the PageRank of a Web
page, often called link spam. 1 That development led to the response of TrustRank and other
techniques for preventing spammers from attacking PageRank. We shall discuss TrustRank
and other approaches to detecting link spam.
Finally, this chapter also covers some variations on PageRank. These techniques include
topic-sensitive PageRank (which can also be adapted for combating link spam) and the
HITS, or “hubs and authorities” approach to evaluating pages on the Web.
5.1 PageRank
We begin with a portion of the history of search engines, in order to motivate the definition
of PageRank, 2 a tool for evaluating the importance of Web pages in a way that it is not easy
to fool. We introduce the idea of “random surfers,” to explain why PageRank is effective.
We then introduce the technique of “taxation” or recycling of random surfers, in order to
avoid certain Web structures that present problems for the simple version of PageRank.
Search WWH ::




Custom Search