Link Analysis - Mining of Massive Datasets

Database Reference

In-Depth Information

✦ Taxation and Random Surfers : The calculation of PageRank using taxation parameter β can be thought of as giving

each random surfer a probability 1 − β of leaving the Web, and introducing an equivalent number of surfers ran-

domly throughout the Web.

✦ Efficient Representation of Transition Matrices : Since a transition matrix is very sparse (almost all entries are 0),

it saves both time and space to represent it by listing its nonzero entries. However, in addition to being sparse, the

nonzero entries have a special property: they are all the same in any given column; the value of each nonzero entry is

the inverse of the number of nonzero entries in that column. Thus, the preferred representation is column-by-column,

where the representation of a column is the number of nonzero entries, followed by a list of the rows where those

entries occur.

✦ Very Large-Scale Matrix-Vector Multiplication : For Web-sized graphs, it may not be feasible to store the entire

PageRank estimate vector in the main memory of one machine. Thus, we can break the vector into k segments and

break the transition matrix into k 2 squares, called blocks, assigning each square to one machine. The vector segments

are each sent to k machines, so there is a small additional cost in replicating the vector.

✦ Representing Blocks of a Transition Matrix : When we divide a transition matrix into square blocks, the columns are

divided into k segments. To represent a segment of a column, nothing is needed if there are no nonzero entries in that

segment. However, if there are one or more nonzero entries, then we need to represent the segment of the column by

the total number of nonzero entries in the column (so we can tell what value the nonzero entries have) followed by a

list of the rows with nonzero entries.

✦ Topic-Sensitive PageRank : If we know the queryer is interested in a certain topic, then it makes sense to bias the

PageRank in favor of pages on that topic. To compute this form of PageRank, we identify a set of pages known to

be on that topic, and we use it as a “teleport set.” The PageRank calculation is modified so that only the pages in the

teleport set are given a share of the tax, rather than distributing the tax among all pages on the Web.

✦ Creating Teleport Sets : For topic-sensitive PageRank to work, we need to identify pages that are very likely to be

about a given topic. One approach is to start with the pages that the open directory (DMOZ) identifies with that top-

ic. Another is to identify words known to be associated with the topic, and select for the teleport set those pages that

have an unusually high number of occurrences of such words.

✦ Link Spam : To fool the PageRank algorithm, unscrupulous actors have created spam farms. These are collections of

pages whose purpose is to concentrate high PageRank on a particular target page.

✦ Structure of a Spam Farm : Typically, a spam farm consists of a target page and very many supporting pages. The

target page links to all the supporting pages, and the supporting pages link only to the target page. In addition, it is

essential that some links from outside the spam farm be created. For example, the spammer might introduce links to

their target page by writing comments in other people's blogs or discussion groups.

✦ TrustRank : One way to ameliorate the effect of link spam is to compute a topic-sensitive PageRank called

TrustRank, where the teleport set is a collection of trusted pages. For example, the home pages of universities could

serve as the trusted set. This technique avoids sharing the tax in the Page-Rank calculation with the large numbers

of supporting pages in spam farms and thus preferentially reduces their PageRank.

✦ Spam Mass : To identify spam farms, we can compute both the conventional PageRank and the TrustRank for all

pages. Those pages that have much lower TrustRank than PageRank are likely to be part of a spam farm.

✦ Hubs and Authorities : While PageRank gives a one-dimensional view of the importance of pages, an algorithm

called HITS tries to measure two different aspects of importance. Authorities are those pages that contain valuable

information. Hubs are pages that, while they do not themselves contain the information, link to places where the

information can be found.

✦ Recursive Formulation of the HITS Algorithm : Calculation of the hubs and authorities scores for pages depends on

solving the recursive equations: “a hub links to many authorities, and an authority is linked to by many hubs.” The

solution to these equations is essentially an iterated matrix-vector multiplication, just like PageRank's. However,

the existence of dead ends or spider traps does not affect the solution to the HITS equations in the way they do for

PageRank, so no taxation scheme is necessary.

Mining of Massive Datasets

Search WWH ::

Custom Search

Home