Database Reference
In-Depth Information
5.4.4
TrustRank
TrustRank is topic-sensitive PageRank, where the “topic” is a set of pages believed to be
trustworthy (not spam). The theory is that while a spam page might easily be made to link
to a trustworthy page, it is unlikely that a trustworthy page would link to a spam page. The
borderline area is a site with blogs or other opportunities for spammers to create links, as
was discussed in Section 5.4.1 . These pages cannot be considered trustworthy, even if their
own content is highly reliable, as would be the case for a reputable newspaper that allowed
readers to post comments.
To implement TrustRank, we need to develop a suitable teleport set of trust-worthy
pages. Two approaches that have been tried are:
(1) Let humans examine a set of pages and decide which of them are trustworthy. For ex-
ample, we might pick the pages of highest PageRank to examine, on the theory that,
while link spam can raise a page's rank from the bottom to the middle of the pack, it is
essentially impossible to give a spam page a PageRank near the top of the list.
(2) Pick a domain whose membership is controlled, on the assumption that it is hard for
a spammer to get their pages into these domains. For example, we could pick the .edu
domain, since university pages are unlikely to be spam farms. We could likewise pick
.mil, or .gov. However, the problem with these specific choices is that they are almost
exclusively US sites. To get a good distribution of trustworthy Web pages, we should
include the analogous sites from foreign countries, e.g., ac.il, or edu.sg.
It is likely that search engines today implement a strategy of the second type routinely, so
that what we think of as PageRank really is a form of TrustRank.
5.4.5
Spam Mass
The idea behind spam mass is that we measure for each page the fraction of its PageRank
that comes from spam. We do so by computing both the ordinary PageRank and the
TrustRank based on some teleport set of trustworthy pages. Suppose page p has PageRank
r and TrustRank t . Then the spam mass of p is ( r t )/ r . A negative or small positive spam
mass means that p is probably not a spam page, while a spam mass close to 1 suggests that
the page probably is spam. It is possible to eliminate pages with a high spam mass from the
index of Web pages used by a search engine, thus eliminating a great deal of the link spam
without having to identify particular structures that spam farmers use.
EXAMPLE 5.12 Let us consider both the PageRank and topic-sensitive Page-Rank that were
computed for the graph of Fig. 5.1 in Examples 5.2 and 5.10 , respectively. In the latter case,
Search WWH ::




Custom Search