Link Analysis - Mining of Massive Datasets

Databases Reference

In-Depth Information

5.4.3 Combating Link Spam

It has become essential for search engines to detect and eliminate link spam,

just as it was necessary in the previous decade to eliminate term spam. There

are two approaches to link spam. One is to look for structures such as the

spam farm in Fig. 5.16, where one page links to a very large number of pages,

each of which links back to it. Search engines surely search for such structures

and eliminate those pages from their index. That causes spammers to develop

different structures that have essentially the same effect of capturing PageRank

for a target page or pages. There is essentially no end to variations of Fig. 5.16,

so this war between the spammers and the search engines will likely go on for

a long time.

However, there is another approach to eliminating link spam that doesn't

rely on locating the spam farms. Rather, a search engine can modify its defini-

tion of PageRank to lower the rank of link-spam pages automatically. We shall

consider two different formulas:

1. TrustRank, a variation of topic-sensitive PageRank designed to lower the

score of spam pages.

2. Spam mass, a calculation that identifies the pages that are likely to be

spam and allows the search engine to eliminate those pages or to lower

their PageRank strongly.

5.4.4 TrustRank

TrustRank is topic-sensitive PageRank, where the “topic” is a set of pages be-

lieved to be trustworthy (not spam). The theory is that while a spam page

might easily be made to link to a trustworthy page, it is unlikely that a trust-

worthy page would link to a spam page. The borderline area is a site with

blogs or other opportunities for spammers to create links, as was discussed in

Section 5.4.1. These pages cannot be considered trustworthy, even if their own

content is highly reliable, as would be the case for a reputable newspaper that

allowed readers to post comments.

To implement TrustRank, we need to develop a suitable teleport set of

trustworthy pages. Two approaches that have been tried are:

1. Let humans examine a set of pages and decide which of them are trust-

worthy. For example, we might pick the pages of highest PageRank to

examine, on the theory that, while link spam can raise a page's rank from

the bottom to the middle of the pack, it is essentially impossible to give

a spam page a PageRank near the top of the list.

2. Pick a domain whose membership is controlled, on the assumption that it

is hard for a spammer to get their pages into these domains. For example,

we could pick the .edu domain, since university pages are unlikely to be

spam farms. We could likewise pick .mil, or .gov. However, the problem

Search WWH ::

Custom Search

Home