Database Reference
In-Depth Information
Taxation and Random Surfers : The calculation of PageRank using taxation parameter β can be thought of as giving
each random surfer a probability 1 − β of leaving the Web, and introducing an equivalent number of surfers ran-
domly throughout the Web.
Efficient Representation of Transition Matrices : Since a transition matrix is very sparse (almost all entries are 0),
it saves both time and space to represent it by listing its nonzero entries. However, in addition to being sparse, the
nonzero entries have a special property: they are all the same in any given column; the value of each nonzero entry is
the inverse of the number of nonzero entries in that column. Thus, the preferred representation is column-by-column,
where the representation of a column is the number of nonzero entries, followed by a list of the rows where those
entries occur.
Very Large-Scale Matrix-Vector Multiplication : For Web-sized graphs, it may not be feasible to store the entire
PageRank estimate vector in the main memory of one machine. Thus, we can break the vector into k segments and
break the transition matrix into k 2 squares, called blocks, assigning each square to one machine. The vector segments
are each sent to k machines, so there is a small additional cost in replicating the vector.
Representing Blocks of a Transition Matrix : When we divide a transition matrix into square blocks, the columns are
divided into k segments. To represent a segment of a column, nothing is needed if there are no nonzero entries in that
segment. However, if there are one or more nonzero entries, then we need to represent the segment of the column by
the total number of nonzero entries in the column (so we can tell what value the nonzero entries have) followed by a
list of the rows with nonzero entries.
Topic-Sensitive PageRank : If we know the queryer is interested in a certain topic, then it makes sense to bias the
PageRank in favor of pages on that topic. To compute this form of PageRank, we identify a set of pages known to
be on that topic, and we use it as a “teleport set.” The PageRank calculation is modified so that only the pages in the
teleport set are given a share of the tax, rather than distributing the tax among all pages on the Web.
Creating Teleport Sets : For topic-sensitive PageRank to work, we need to identify pages that are very likely to be
about a given topic. One approach is to start with the pages that the open directory (DMOZ) identifies with that top-
ic. Another is to identify words known to be associated with the topic, and select for the teleport set those pages that
have an unusually high number of occurrences of such words.
Link Spam : To fool the PageRank algorithm, unscrupulous actors have created spam farms. These are collections of
pages whose purpose is to concentrate high PageRank on a particular target page.
Structure of a Spam Farm : Typically, a spam farm consists of a target page and very many supporting pages. The
target page links to all the supporting pages, and the supporting pages link only to the target page. In addition, it is
essential that some links from outside the spam farm be created. For example, the spammer might introduce links to
their target page by writing comments in other people's blogs or discussion groups.
TrustRank : One way to ameliorate the effect of link spam is to compute a topic-sensitive PageRank called
TrustRank, where the teleport set is a collection of trusted pages. For example, the home pages of universities could
serve as the trusted set. This technique avoids sharing the tax in the Page-Rank calculation with the large numbers
of supporting pages in spam farms and thus preferentially reduces their PageRank.
Spam Mass : To identify spam farms, we can compute both the conventional PageRank and the TrustRank for all
pages. Those pages that have much lower TrustRank than PageRank are likely to be part of a spam farm.
Hubs and Authorities : While PageRank gives a one-dimensional view of the importance of pages, an algorithm
called HITS tries to measure two different aspects of importance. Authorities are those pages that contain valuable
information. Hubs are pages that, while they do not themselves contain the information, link to places where the
information can be found.
Recursive Formulation of the HITS Algorithm : Calculation of the hubs and authorities scores for pages depends on
solving the recursive equations: “a hub links to many authorities, and an authority is linked to by many hubs.” The
solution to these equations is essentially an iterated matrix-vector multiplication, just like PageRank's. However,
the existence of dead ends or spider traps does not affect the solution to the HITS equations in the way they do for
PageRank, so no taxation scheme is necessary.
Search WWH ::




Custom Search