Link Analysis - Mining of Massive Datasets

Databases Reference

In-Depth Information

F

Dead Ends: A dead end is a Web page with no links out. The presence of

dead ends will cause the PageRank of some or all of the pages to go to 0

in the iterative computation, including pages that are not dead ends. We

can eliminate all dead ends before undertaking a PageRank calculation

by recursively dropping nodes with no arcs out. Note that dropping one

node can cause another, which linked only to it, to become a dead end,

so the process must be recursive.

F

Spider Traps: A spider trap is a set of nodes that, while they may link to

each other, have no links out to other nodes. In an iterative calculation

of PageRank, the presence of spider traps cause all the PageRank to be

captured within that set of nodes.

F

Taxation Schemes: To counter the effect of spider traps (and of dead ends,

if we do not eliminate them), PageRank is normally computed in a way

that modifies the simple iterative multiplication by the transition matrix.

A parameter β is chosen, typically around 0.85. Given an estimate of the

PageRank, the next estimate is computed by multiplying the estimate by

β times the transition matrix, and then adding (1−β)/n to the estimate

for each page, where n is the total number of pages.

F

Taxation and Random Surfers: The calculation of PageRank using taxa-

tion parameter β can be thought of as giving each random surfer a prob-

ability 1−β of leaving the Web, and introducing an equivalent number

of surfers randomly throughout the Web.

F

E cient Representation of Transition Matrices: Since a transition matrix

is very sparse (almost all entries are 0), it saves both time and space

to represent it by listing its nonzero entries. However, in addition to

being sparse, the nonzero entries have a special property: they are all the

same in any given column; the value of each nonzero entry is the inverse

of the number of nonzero entries in that column. Thus, the preferred

representation is column-by-column, where the representation of a column

is the number of nonzero entries, followed by a list of the rows where those

entries occur.

F

Very Large-Scale Matrix-Vector Multiplication: For Web-sized graphs, it

may not be feasible to store the entire PageRank estimate vector in the

main memory of one machine. Thus, we can break the vector into k

segments and break the transition matrix into k 2 squares, called blocks,

assigning each square to one machine. The vector segments are each sent

to k machines, so there is a small additional cost in replicating the vector.

F

Representing Blocks of a Transition Matrix : When we divide a transition

matrix into square blocks, the columns are divided into k segments. To

represent a segment of a column, nothing is needed if there are no nonzero

entries in that segment. However, if there are one or more nonzero entries,

Mining of Massive Datasets

Search WWH ::

Custom Search

Home