Database Reference
In-Depth Information
The input is a text file with two URLs per line: a page and an outbound link from that
page. For example, the following file encodes the fact that
A
has links to
B
and
C
, and
B
has a link to
D
:
www.A.com www.B.com
www.A.com www.C.com
www.B.com www.D.com
Going back to the code, the first line reads the input and computes an initial
PageRankData
object for each unique page.
PageRankData
is a simple Java class
with fields for the score, the previous score (this will be used to check for convergence),
and a list of outbound links:
public static class
PageRankData
{
public
float
score
;
public
float
lastScore
;
public
List
<
String
>
urls
;
// ... methods elided
}
The goal of the algorithm is to compute a score for each page, representing its relative im-
portance. All pages start out equal, so the initial score is set to be 1 for each page, and the
previous score is 0. Creating the list of outbound links is achieved using the Crunch oper-
ations of grouping the input by the first field (page), then aggregating the values (out-
The iteration is carried out using a regular Java
while
loop. The scores are updated in
each iteration of the loop by calling the
pageRank()
method, which encapsulates the
PageRank algorithm as a series of Crunch operations. If the delta between the last set of
scores and the new set of scores is below a small enough value (0.01), then the scores
have converged and the algorithm terminates. The delta is computed by the
com-
puteDelta()
method, a Crunch aggregation that finds the largest absolute difference in
page score for all the pages in the collection.
So when is the pipeline run? The answer is each time
pDelta.getValue()
is called.
The first time through the loop, no
PCollection
s have been materialized yet, so the
jobs for
readUrls()
,
pageRank()
, and
computeDelta()
must be run in order to
compute
delta
. On subsequent iterations only the jobs to compute the new scores
(
pageRank()
) and
delta
(
computeDelta()
) need be run.