Database Reference
In-Depth Information
The input is a text file with two URLs per line: a page and an outbound link from that
page. For example, the following file encodes the fact that A has links to B and C , and B
has a link to D :
www.A.com www.B.com
www.A.com www.C.com
www.B.com www.D.com
Going back to the code, the first line reads the input and computes an initial
PageRankData object for each unique page. PageRankData is a simple Java class
with fields for the score, the previous score (this will be used to check for convergence),
and a list of outbound links:
public static class PageRankData {
public float score ;
public float lastScore ;
public List < String > urls ;
// ... methods elided
}
The goal of the algorithm is to compute a score for each page, representing its relative im-
portance. All pages start out equal, so the initial score is set to be 1 for each page, and the
previous score is 0. Creating the list of outbound links is achieved using the Crunch oper-
ations of grouping the input by the first field (page), then aggregating the values (out-
bound links) into a list. [ 127 ]
The iteration is carried out using a regular Java while loop. The scores are updated in
each iteration of the loop by calling the pageRank() method, which encapsulates the
PageRank algorithm as a series of Crunch operations. If the delta between the last set of
scores and the new set of scores is below a small enough value (0.01), then the scores
have converged and the algorithm terminates. The delta is computed by the com-
puteDelta() method, a Crunch aggregation that finds the largest absolute difference in
page score for all the pages in the collection.
So when is the pipeline run? The answer is each time pDelta.getValue() is called.
The first time through the loop, no PCollection s have been materialized yet, so the
jobs for readUrls() , pageRank() , and computeDelta() must be run in order to
compute delta . On subsequent iterations only the jobs to compute the new scores
( pageRank() ) and delta ( computeDelta() ) need be run.
Search WWH ::




Custom Search