Databases Reference
In-Depth Information
Chapter 5
Link Analysis
One of the biggest changes in our lives in the decade following the turn of
the century was the availability of e cient and accurate Web search, through
search engines such as Google. While Google was not the first search engine, it
was the first able to defeat the spammers who had made search almost useless.
Moreover, the innovation provided by Google was a nontrivial technological
advance, called “PageRank.” We shall begin the chapter by explaining what
PageRank is and how it is computed e ciently.
Yet the war between those who want to make the Web useful and those
who would exploit it for their own purposes is never over. When PageRank was
established as an essential technique for a search engine, spammers invented
ways to manipulate the PageRank of a Web page, often called link spam. 1
That development led to the response of TrustRank and other techniques for
preventing spammers from attacking PageRank. We shall discuss TrustRank
and other approaches to detecting link spam.
Finally, this chapter also covers some variations on PageRank. These tech-
niques include topic-sensitive PageRank (which can also be adapted for combat-
ing link spam) and the HITS, or “hubs and authorities” approach to evaluating
pages on the Web.
5.1
PageRank
We begin with a portion of the history of search engines, in order to motivate
the definition of PageRank, 2 a tool for evaluating the importance of Web pages
in a way that it is not easy to fool. We introduce the idea of “random surfers,”
to explain why PageRank is effective. We then introduce the technique of “tax-
ation” or recycling of random surfers, in order to avoid certain Web structures
1 Link spammers sometimes try to make their unethicality less apparent by referring to
what they do as “search-engine optimization.”
2 The term PageRank comes from Larry Page, the inventor of the idea and a founder of
Google.
145
Search WWH ::




Custom Search