Databases Reference
In-Depth Information
5.4.3 Combating Link Spam
It has become essential for search engines to detect and eliminate link spam,
just as it was necessary in the previous decade to eliminate term spam. There
are two approaches to link spam. One is to look for structures such as the
spam farm in Fig. 5.16, where one page links to a very large number of pages,
each of which links back to it. Search engines surely search for such structures
and eliminate those pages from their index. That causes spammers to develop
different structures that have essentially the same effect of capturing PageRank
for a target page or pages. There is essentially no end to variations of Fig. 5.16,
so this war between the spammers and the search engines will likely go on for
a long time.
However, there is another approach to eliminating link spam that doesn't
rely on locating the spam farms. Rather, a search engine can modify its defini-
tion of PageRank to lower the rank of link-spam pages automatically. We shall
consider two different formulas:
1. TrustRank, a variation of topic-sensitive PageRank designed to lower the
score of spam pages.
2. Spam mass, a calculation that identifies the pages that are likely to be
spam and allows the search engine to eliminate those pages or to lower
their PageRank strongly.
5.4.4 TrustRank
TrustRank is topic-sensitive PageRank, where the “topic” is a set of pages be-
lieved to be trustworthy (not spam). The theory is that while a spam page
might easily be made to link to a trustworthy page, it is unlikely that a trust-
worthy page would link to a spam page. The borderline area is a site with
blogs or other opportunities for spammers to create links, as was discussed in
Section 5.4.1. These pages cannot be considered trustworthy, even if their own
content is highly reliable, as would be the case for a reputable newspaper that
allowed readers to post comments.
To implement TrustRank, we need to develop a suitable teleport set of
trustworthy pages. Two approaches that have been tried are:
1. Let humans examine a set of pages and decide which of them are trust-
worthy. For example, we might pick the pages of highest PageRank to
examine, on the theory that, while link spam can raise a page's rank from
the bottom to the middle of the pack, it is essentially impossible to give
a spam page a PageRank near the top of the list.
2. Pick a domain whose membership is controlled, on the assumption that it
is hard for a spammer to get their pages into these domains. For example,
we could pick the .edu domain, since university pages are unlikely to be
spam farms. We could likewise pick .mil, or .gov. However, the problem
Search WWH ::




Custom Search