Databases Reference
In-Depth Information
with these specific choices is that they are almost exclusively US sites. To
get a good distribution of trustworthy Web pages, we should include the
analogous sites from foreign countries, e.g., ac.il, or edu.sg.
It is likely that search engines today implement a strategy of the second type
routinely, so that what we think of as PageRank really is a form of TrustRank.
5.4.5 Spam Mass
The idea behind spam mass is that we measure the fraction of its PageRank
that comes from spam. We do so by computing both the ordinary PageRank
and the TrustRank based on some teleport set of trustworthy pages. Suppose
page p has PageRank r and TrustRank t. Then the spam mass of p is (r−t)/r.
A negative or small positive spam mass means that p is probably not a spam
page, while a spam mass close to 1 suggests that the page probably is spam.
It is possible to eliminate pages with a high spam mass from the index of Web
pages used by a search engine, thus eliminating a great deal of the link spam
without having to identify particular structures that spam farmers use.
Example 5.12 : Let us consider both the PageRank and topic-sensitive Page-
Rank that were computed for the graph of Fig. 5.1 in Examples 5.2 and 5.10,
respectively. In the latter case, the teleport set was nodes B and D, so let
us assume those are the trusted pages. Figure 5.17 tabulates the PageRank,
TrustRank, and spam mass for each of the four nodes.
Node
PageRank
TrustRank
Spam Mass
A
3/9
54/210
0.229
B
2/9
59/210
-0.264
C
2/9
38/210
0.186
D
2/9
59/210
-0.264
Figure 5.17: Calculation of spam mass
In this simple example, the only conclusion is that the nodes B and D, which
were a priori determined not to be spam, have negative spam mass and are
therefore not spam. The other two nodes, A and C, each have a positive spam
mass, since their PageRanks are higher than their TrustRanks. For instance,
the spam mass of A is computed by taking the difference 3/9−54/210 = 8/105
and dividing 8/105 by the PageRank 3/9 to get 8/35 or about 0.229. However,
their spam mass is still closer to 0 than to 1, so it is probable that they are not
spam.
2
5.4.6 Exercises for Section 5.4
Exercise 5.4.1 : In Section 5.4.2 we analyzed the spam farm of Fig. 5.16, where
every supporting page links back to the target page. Repeat the analysis for a
Search WWH ::




Custom Search